1 First task: Linear Regression

1.1 Introduction and Motivation

Extreme flooding has become one of the most pressing environmental challenges affecting many countries worldwide. In late 2025, Sri Lanka faced one of the most severe flood emergencies in its recent history. As reported by BBC News on 30 November 2025, the government declared a national state of emergency after widespread floods and mudslides displaced thousands of citizens and caused significant infrastructure damage. The president described the crisis as the “most challenging natural disaster” the country has ever confronted. This event drew international attention and highlighted the growing vulnerability of many regions to climate-related hazards.

Motivated by this real-world context, this project focuses on studying flood risk in Sri Lanka using a rich, multidimensional dataset. The dataset contains 25,000 geo-referenced observations that combine environmental indicators (such as recent rainfall and vegetation indices), terrain characteristics (elevation, proximity to rivers), land-use descriptors, demographic density, infrastructure quality, and historical flood occurrences. Importantly, it provides a continuous flood risk score, which enables quantitative exploration of how geographical, environmental, and socio-economic factors jointly shape exposure to floods.

The originality of this project lies in connecting a current and socially relevant issue with a data-driven risk analysis. Although the dataset is simulated, it is constructed using realistic geospatial and environmental proxies. This makes it suitable for understanding which features are associated with flood-prone areas and for illustrating the usefulness of empirical data in the context of disaster preparedness. By focusing on a case that is both timely and meaningful, the project aims to provide a thoughtful, evidence-based perspective on a problem of growing global significance.

1.2 The dataset

The dataset used in this project comes from the Kaggle repository Sri Lanka Flood Risk and Inundation Dataset, which provides a synthetic but realistic geospatial representation of 25,000 locations across Sri Lanka. Each observation corresponds to a small spatial unit described by environmental, topographic, socio-demographic and infrastructural indicators. The table below summarises the definition of all variables as reported in the official Kaggle documentation.

Description of all dataset variables.
Column Type Description
record_id string Synthetic unique record id (e.g., F100000).
district string Administrative district name (Sri Lanka).
place_name string Synthetic town/village name.
latitude float WGS84 latitude.
longitude float WGS84 longitude.
elevation_m integer Elevation above sea level (meters).
distance_to_river_m float Distance to nearest river (meters).
landcover string Land cover type: Urban, Agriculture, Forest, Wetland, etc.
soil_type string Soil category (Clay, Sandy, Loamy, Peaty, Silty).
water_supply string Primary water source (Municipal, Well, Surface, etc.).
electricity string Electricity availability type.
road_quality string Road access / quality category.
population_density_per_km2 integer Population density (people per km²).
built_up_percent float Percent built-up area in the location.
urban_rural string Urban / Rural classification.
rainfall_7d_mm float Rainfall accumulated over the last 7 days (mm).
monthly_rainfall_mm float Monthly accumulated rainfall (mm).
drainage_index float Proxy index 0–1 (lower values indicate poorer drainage).
ndvi float Normalized Difference Vegetation Index (−1 to 1).
ndwi float Normalized Difference Water Index (−1 to 1).
water_presence_flag string Likely / unlikely surface water presence.
historical_flood_count integer Number of historical flood events (synthetic).
infrastructure_score integer Composite infrastructure score (0–100).
nearest_hospital_km float Distance to nearest hospital (km).
nearest_evac_km float Distance to nearest evacuation point (km).
flood_risk_score float Derived flood risk score (0–100) combining features.
flood_occurrence_current_event string Yes/No flag for the simulated current flood event.
inundation_area_sqm integer Estimated inundation area in square metres (0 if not flooded).
is_good_to_live string Yes/No recommendation based on risk and infrastructure.
reason_not_good_to_live string Explanation when is_good_to_live == No. 
is_synthetic boolean Indicator that the dataset is synthetic.
generation_date date Generation timestamp (YYYY-MM-DD).

Note: The dataset can be accessed at the following URL: https://www.kaggle.com/datasets/dewminimnaadi/sri-lanka-flood-risk-and-inundation-dataset.

1.3 Goal

As established in the preceding sections, the objective of this study is to examine how environmental, geographical, and socio-demographic factors influence the degree of flood exposure across Sri Lanka. To this end, the analysis centres on predicting the continuous variable flood_risk_score, an index ranging from 0 to 100 that encapsulates the estimated flood risk at each geographic location.

This score is derived from multiple underlying attributes, including recent rainfall patterns, terrain elevation, proximity to rivers, vegetation indices, population density, and infrastructure characteristics. By modelling this variable, we aim to identify the most influential predictors of flood risk and to quantify their association with the observed spatial variability.

1.4 Data Import and Train–Test Split for Linear Regression

To prepare the dataset for the modelling stage, we first load the cleaned CSV file and then create an 80/20 train–test split.

For that aim, we rely on the initial_split() function from the tidymodels package and apply stratification on the response variable (flood_risk_score) to ensure that both partitions preserve a similar distribution of the target.

# First, we set a seed to ensure reproducibility
set.seed(123)

# Loading the data
srilanka <- read_csv("sri_lanka_flood_risk_dataset_25000.csv")

# Obtaining the stratified 80/20 train–test split
data_split <- initial_split(
  srilanka,
  prop   = 0.8,
  strata = flood_risk_score
)

# Obtaining Train/Test Data
TrainData <- training(data_split)
TestData  <- testing(data_split)

# We can perform a balance verification

tibble(
  Set = c("Train", "Test"),
  n = c(nrow(TrainData), nrow(TestData)),
  mean_risk = c(mean(TrainData$flood_risk_score),
                mean(TestData$flood_risk_score)),
  sd_risk = c(sd(TrainData$flood_risk_score),
              sd(TestData$flood_risk_score))
)
## # A tibble: 2 × 4
##   Set       n mean_risk sd_risk
##   <chr> <int>     <dbl>   <dbl>
## 1 Train 19998      33.3    14.8
## 2 Test   5002      33.2    14.7
# Showing a quick overview of the training set
glimpse(TrainData)
## Rows: 19,998
## Columns: 32
## $ record_id                      <chr> "F100003", "F100005", "F100012", "F1000…
## $ district                       <chr> "Puttalam", "Galle", "Mannar", "Matara"…
## $ place_name                     <chr> "Mahagama East", "Udupola", "Galakanda"…
## $ latitude                       <dbl> 5.922365, 6.394456, 9.852201, 6.952797,…
## $ longitude                      <dbl> 81.48479, 80.45167, 80.81168, 80.66979,…
## $ elevation_m                    <dbl> 111, 34, 112, 153, 1473, 84, 53, 762, 7…
## $ distance_to_river_m            <dbl> 2950.4, 2086.5, 631.7, 632.0, 880.1, 14…
## $ landcover                      <chr> "Urban", "Forest", "Urban", "Forest", "…
## $ soil_type                      <chr> "Sandy", "Silty", "Clay", "Sandy", "Loa…
## $ water_supply                   <chr> "Surface water", "Well", "Rainwater har…
## $ electricity                    <chr> "Grid", "Grid", "Off-grid (solar)", "Gr…
## $ road_quality                   <chr> "Good (paved)", "Good (paved)", "Fair",…
## $ population_density_per_km2     <dbl> 729, 1730, 261, 10, 1164, 543, 1191, 27…
## $ built_up_percent               <dbl> 16.3, 27.6, 5.7, 17.8, 1.0, 2.5, 11.1, …
## $ urban_rural                    <chr> "Urban", "Urban", "Urban", "Rural", "Ur…
## $ rainfall_7d_mm                 <dbl> 10.1, 19.5, 50.8, 111.9, 32.8, 66.4, 58…
## $ monthly_rainfall_mm            <dbl> 171.6, 93.4, 8.0, 2.9, 139.4, 170.1, 10…
## $ drainage_index                 <dbl> 0.288, 0.870, 0.376, 0.557, 0.251, 0.62…
## $ ndvi                           <dbl> 0.439, 0.666, -0.303, 0.732, 0.109, -0.…
## $ ndwi                           <dbl> -0.259, -0.212, -0.315, 0.343, -0.009, …
## $ water_presence_flag            <chr> "Unlikely", "Unlikely", "Unlikely", "Li…
## $ historical_flood_count         <dbl> 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, …
## $ infrastructure_score           <dbl> 58, 45, 56, 27, 41, 76, 36, 30, 43, 33,…
## $ nearest_hospital_km            <dbl> 11.44, 9.97, 10.71, 14.14, 6.06, 12.45,…
## $ nearest_evac_km                <dbl> 4.94, 0.96, 4.85, 5.05, 7.65, 4.14, 4.3…
## $ flood_risk_score               <dbl> 17.44, 12.71, 6.67, 14.05, 18.34, 17.09…
## $ flood_occurrence_current_event <chr> "No", "No", "No", "No", "No", "No", "No…
## $ inundation_area_sqm            <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ is_good_to_live                <chr> "Yes", "Yes", "Yes", "No", "Yes", "Yes"…
## $ reason_not_good_to_live        <chr> NA, NA, NA, "Poor infrastructure", NA, …
## $ is_synthetic                   <lgl> TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRU…
## $ generation_date                <date> 2024-11-06, 2025-09-19, 2025-04-06, 20…

1.5 Descriptive and Explanatory Analysis

1.5.1 Summary statistics and variable types

In order to understand the structure of the data before fitting any model, we first compute basic summary statistics for all variables in the training set. This allows us to obtain an overview of typical values, ranges and potential anomalies for the main features used in the analysis.

In parallel, we inspect the data types of each column to distinguish between continuous variables, categorical factors and boolean indicators, which will determine how these variables are treated in the subsequent linear regression models.

# Basic summary statistics for the training data
summary(TrainData)
##   record_id           district          place_name           latitude    
##  Length:19998       Length:19998       Length:19998       Min.   :5.900  
##  Class :character   Class :character   Class :character   1st Qu.:6.914  
##  Mode  :character   Mode  :character   Mode  :character   Median :7.927  
##                                                           Mean   :7.922  
##                                                           3rd Qu.:8.927  
##                                                           Max.   :9.950  
##    longitude      elevation_m   distance_to_river_m  landcover        
##  Min.   :79.65   Min.   :   0   Min.   :    0.0     Length:19998      
##  1st Qu.:80.21   1st Qu.:  31   1st Qu.:  583.7     Class :character  
##  Median :80.76   Median :  68   Median : 1407.6     Mode  :character  
##  Mean   :80.77   Mean   : 187   Mean   : 2014.5                       
##  3rd Qu.:81.32   3rd Qu.: 124   3rd Qu.: 2785.7                       
##  Max.   :81.90   Max.   :2148   Max.   :16702.9                       
##   soil_type         water_supply       electricity        road_quality      
##  Length:19998       Length:19998       Length:19998       Length:19998      
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##  population_density_per_km2 built_up_percent urban_rural       
##  Min.   :  10.0             Min.   : 1.00    Length:19998      
##  1st Qu.: 290.0             1st Qu.: 9.80    Class :character  
##  Median : 521.5             Median :23.30    Mode  :character  
##  Mean   : 627.5             Mean   :25.17                      
##  3rd Qu.: 798.0             3rd Qu.:37.10                      
##  Max.   :3113.0             Max.   :95.00                      
##  rainfall_7d_mm   monthly_rainfall_mm drainage_index        ndvi       
##  Min.   :  0.00   Min.   :  0.0       Min.   :0.0030   Min.   :-0.791  
##  1st Qu.: 33.10   1st Qu.:106.7       1st Qu.:0.3310   1st Qu.:-0.019  
##  Median : 69.50   Median :202.0       Median :0.5010   Median : 0.167  
##  Mean   : 80.44   Mean   :216.1       Mean   :0.5009   Mean   : 0.169  
##  3rd Qu.:116.80   3rd Qu.:306.8       3rd Qu.:0.6730   3rd Qu.: 0.353  
##  Max.   :372.20   Max.   :863.7       Max.   :0.9990   Max.   : 1.000  
##       ndwi          water_presence_flag historical_flood_count
##  Min.   :-1.00000   Length:19998        Min.   :0.0000        
##  1st Qu.:-0.14900   Class :character    1st Qu.:0.0000        
##  Median : 0.02600   Mode  :character    Median :0.0000        
##  Mean   : 0.03081                       Mean   :0.2029        
##  3rd Qu.: 0.20600                       3rd Qu.:0.0000        
##  Max.   : 1.00000                       Max.   :5.0000        
##  infrastructure_score nearest_hospital_km nearest_evac_km  flood_risk_score
##  Min.   : 5.00        Min.   : 0.000      Min.   : 0.000   Min.   :  0.00  
##  1st Qu.:32.00        1st Qu.: 2.250      1st Qu.: 1.730   1st Qu.: 22.54  
##  Median :46.00        Median : 5.520      Median : 4.160   Median : 32.30  
##  Mean   :46.49        Mean   : 7.966      Mean   : 6.021   Mean   : 33.27  
##  3rd Qu.:60.00        3rd Qu.:11.080      3rd Qu.: 8.360   3rd Qu.: 42.94  
##  Max.   :95.00        Max.   :75.710      Max.   :55.950   Max.   :100.00  
##  flood_occurrence_current_event inundation_area_sqm is_good_to_live   
##  Length:19998                   Min.   :     0      Length:19998      
##  Class :character               1st Qu.:     0      Class :character  
##  Mode  :character               Median :     0      Mode  :character  
##                                 Mean   : 12381                        
##                                 3rd Qu.:     0                        
##                                 Max.   :309837                        
##  reason_not_good_to_live is_synthetic   generation_date     
##  Length:19998            Mode:logical   Min.   :2023-12-10  
##  Class :character        TRUE:19998     1st Qu.:2024-06-08  
##  Mode  :character                       Median :2024-12-10  
##                                         Mean   :2024-12-08  
##                                         3rd Qu.:2025-06-09  
##                                         Max.   :2025-12-08
# Variable type table
var_types <- tibble(
  Variable = names(TrainData),
  Type = sapply(TrainData, function(x) class(x)[1])
)

kable(var_types,
      caption = "Variable classes in the training dataset",
      align = "ll",
      row.names = FALSE) |>
  kable_styling(full_width = FALSE,
                bootstrap_options = c("striped", "hover"))
Variable classes in the training dataset
Variable Type
record_id character
district character
place_name character
latitude numeric
longitude numeric
elevation_m numeric
distance_to_river_m numeric
landcover character
soil_type character
water_supply character
electricity character
road_quality character
population_density_per_km2 numeric
built_up_percent numeric
urban_rural character
rainfall_7d_mm numeric
monthly_rainfall_mm numeric
drainage_index numeric
ndvi numeric
ndwi numeric
water_presence_flag character
historical_flood_count numeric
infrastructure_score numeric
nearest_hospital_km numeric
nearest_evac_km numeric
flood_risk_score numeric
flood_occurrence_current_event character
inundation_area_sqm numeric
is_good_to_live character
reason_not_good_to_live character
is_synthetic logical
generation_date Date

Overall, the training set consists of 19,998 observations and 32 variables. We can extract the following conclusions:

  • The summary statistics reveal that the main geographical coordinates latitude and longitude span relatively narrow ranges, consistently with the spatial extent of Sri Lanka. Terrain elevation elevation_m varies from low-lying areas close to sea level up to more than 2,100 metres, while distance_to_river_m covers a wide interval from locations directly adjacent to rivers to cells more than 16 km away.

  • Environmental indicators display substantial variability: recent rainfall rainfall_7d_mm ranges from 0 to over 370 mm, monthly rainfall monthly_rainfall_mm reaches almost 870 mm, and the indices ndvi and ndwi cover most of their theoretical support [-1,1], suggesting heterogeneous vegetation and surface-water conditions across the study area.

  • Socio-demographic and exposure variables also show marked dispersion. For example, population_density_per_km2 ranges from sparsely populated cells with only 10 inhabitants per km² to highly dense locations exceeding 3,100 inhabitants per km², and built_up_percent spans from almost non-developed areas to cells with 95% built-up surface.

  • The variable inundation_area_sqm is highly skewed, with many zeros and a few very large values above 300,000 m², which anticipates the need for special treatment if this variable is later analysed.

  • The response variable flood_risk_score takes values between 0 and 100, with a median around 32 and an upper quartile close to 43, indicating that most locations exhibit low to moderate estimated risk, while high-risk cells are comparatively less frequent.

The variable–type table clarifies the structure of the dataset.

  • Most predictors of interest are continuous numeric variables, which are naturally suited for inclusion in linear regression models (e.g. elevation_m, distance_to_river_m, rainfall_7d_mm, monthly_rainfall_mm, population_density_per_km2, historical_flood_count, infrastructure_score, etc.).

  • A second group comprises categorical variables stored as character strings, such as district, landcover, soil_type, water_supply,electricity,road_quality,etc. These must be converted to factors before modelling so that the regression coefficients represent differences between categories.

  • Finally, is_synthetic (logical) and generation_date (date) play a descriptive or metadata role and will not be used as predictors in the subsequent linear regression analysis.

1.5.2 Distribution of the response variable flood_risk_score

Before exploring relationships with the predictors, it is useful to examine the marginal distribution of the response variable flood_risk_score in the training set. This helps to identify potential skewness, multimodality or extreme values that may influence the behaviour of linear regression models and the interpretation of their residuals.

ggplot(TrainData, aes(x = flood_risk_score)) +
  geom_density(fill = "skyblue", alpha = 0.6, linewidth = 1) +
  labs(
    title = "Distribution of the flood risk score",
    x = "Flood risk score",
    y = "Density"
  ) +
  theme_minimal(base_size = 13)

The distribution of flood_risk_score in the training set exhibits a unimodal and approximately bell-shaped pattern, with most observations concentrated between 15 and 45. The density curve peaks around the upper 20s to lower 30s, which is consistent with the median of 32.3 reported in the summary statistics. The right tail extends gradually toward higher scores, indicating the presence of a smaller proportion of locations classified as high-risk (scores above 60). Conversely, very low values close to zero are relatively uncommon, suggesting that few areas are assigned negligible flood risk. Overall, the distribution is only mildly right-skewed and does not display sharp discontinuities or multimodality, making flood_risk_score suitable as a continuous response variable for linear modelling.

Given the approximately symmetric and well-behaved distribution of flood_risk_score, no transformation of the response variable is required prior to fitting linear regression models. This decision may be revisited after inspecting model residuals, but the current distribution does not indicate the need for a logarithmic or Box–Cox transformation.

1.5.3 Univariate exploration of numerical predictors

To explore the marginal behaviour of the most informative numerical predictors, we begin by analysing the distributions of three continuous variables that have a strong theoretical relationship with flood exposure: elevation_m, distance_to_river_m and rainfall_7d_mm. These predictors cover complementary physical dimensions relevant to flood processes:

  • elevation_m reflects terrain height, which is a primary determinant of water accumulation. Lower elevations generally correspond to a higher likelihood of flooding.

  • distance_to_river_m captures proximity to hydrological networks. Locations closer to rivers tend to exhibit greater flood susceptibility due to overflow and channel expansion during heavy rainfall.

  • rainfall_7d_mm represents short-term precipitation intensity. High recent rainfall levels increase soil saturation and runoff, amplifying flood risk.

By examining these variables individually, we gain insight into their scale, variability and potential skewness, which helps anticipate how they may influence the performance and assumptions of linear regression models.

p1 <- ggplot(TrainData, aes(x = elevation_m)) +
  geom_density(fill = "skyblue", alpha = 0.6, linewidth = 1) +
  labs(title = "Distribution of elevation",
       x = "Elevation (m)", y = "Density") +
  theme_minimal(base_size = 13)

p2 <- ggplot(TrainData, aes(x = distance_to_river_m)) +
  geom_density(fill = "skyblue", alpha = 0.6, linewidth = 1) +
  labs(title = "Distribution of distance to nearest river",
       x = "Distance (m)", y = "Density") +
  theme_minimal(base_size = 13)

p3 <- ggplot(TrainData, aes(x = rainfall_7d_mm)) +
  geom_density(fill = "skyblue", alpha = 0.6, linewidth = 1) +
  labs(title = "Distribution of rainfall over the last 7 days",
       x = "Rainfall (mm)", y = "Density") +
  theme_minimal(base_size = 13)

p1; p2; p3

The univariate distributions of the selected continuous predictors reveal markedly different patterns that reflect the underlying physical processes related to flood exposure.

  • The variable elevation_m is highly right-skewed, with the vast majority of locations concentrated at low altitudes, which is consistent with the geographical profile of Sri Lanka.
  • In contrast, distance_to_river_m decreases sharply near zero, indicating that many spatial units lie close to river channels, a factor that may substantially influence flood susceptibility.
  • Finally, rainfall_7d_mm exhibits a moderately right-skewed distribution, with most observations falling below 150 mm but a non-negligible tail representing intense rainfall events.

1.5.4 Univariate exploration of categorical predictors

To complement the analysis of continuous predictors, we now examine the marginal behaviour of three categorical variables that capture relevant land-use and socio-geographical characteristics: landcover, urban_rural and soil_type. These variables describe, respectively, the dominant surface cover of each location, whether the area is classified as urban or rural, and the prevailing soil category.

All three factors may influence flood dynamics through their effects on infiltration capacity, runoff generation and human exposure. For example, impermeable urban surfaces or certain soil types can reduce infiltration and increase surface runoff, whereas vegetated or permeable areas may attenuate flood peaks. Understanding the distribution of these categories provides useful context for interpreting their potential contribution to the variability of flood_risk_score.

p4 <- ggplot(TrainData, aes(x = landcover)) +
  geom_bar(fill = "skyblue", alpha = 0.7) +
  labs(
    title = "Distribution of land-cover categories",
    x = "Land-cover type",
    y = "Count"
  ) +
  theme_minimal(base_size = 13) +
  theme(axis.text.x = element_text(angle = 25, hjust = 1))

p5 <- ggplot(TrainData, aes(x = urban_rural)) +
  geom_bar(fill = "skyblue", alpha = 0.7) +
  labs(
    title = "Distribution of urban vs. rural locations",
    x = "Category",
    y = "Count"
  ) +
  theme_minimal(base_size = 13)

p6 <- ggplot(TrainData, aes(x = soil_type)) +
  geom_bar(fill = "skyblue", alpha = 0.7) +
  labs(
    title = "Distribution of soil_type categories",
    x = "Soil type",
    y = "Count"
  ) +
  theme_minimal(base_size = 13)

p4; p5; p6

1.5.5 Selection of predictors for the regression analysis

Before fitting the linear regression models, it is necessary to identify which variables are appropriate to use as predictors of flood_risk_score. Several attributes in the dataset are unsuitable due to conceptual, statistical, or methodological considerations. Below, we explain the rationale for excluding specific variables.

  • generation_date: this is a metadata field describing when each synthetic observation was generated. It contains no geographical or environmental information relevant to flood risk.
  • is_synthetic: represents a constant attribute equal to TRUE for all observations. As it lacks variability, it cannot contribute to model estimation.
  • is_good_to_live and reason_not_good_to_live: these variables represent derived recommendations partly based on flood risk itself. Including them would introduce circularity and bias, as they indirectly encode the target variable.
  • flood_occurrence_current_event: reflects participation in a simulated current flood event. As it depends on extreme-event conditions rather than long-term risk, including it would distort the modelling of flood_risk_score.
  • inundation_area_sqm: represents the simulated inundated area for the current event. Its strong dependence on the same mechanisms that generate the risk score makes it unsuitable due to potential data leakage.
  • longitude and latitude: although spatial position is relevant for flood modelling, raw coordinates introduce a highly non-linear spatial structure that is poorly captured by simple linear regression. These variables also risk acting as proxies for unobserved processes, causing unstable and difficult-to-interpret coefficients.
  • district: contains many categories and would require generating a large set of dummy variables. Such encoding increases model complexity, introduces sparse estimations, and may overshadow the effect of meaningful environmental variables.

Moreover, it is important to verify whether any missing values are present among the selected predictors or in the response variable. Detecting and addressing missing data at this stage ensures the integrity of the regression analysis and prevents potential biases or inconsistencies in the model estimates.

# List of selected continuous predictors
vars <- c(
  "elevation_m",
  "distance_to_river_m",
  "rainfall_7d_mm",
  "monthly_rainfall_mm",
  "drainage_index",
  "ndvi",
  "ndwi",
  "population_density_per_km2",
  "built_up_percent",
  "infrastructure_score",
  "nearest_hospital_km",
  "nearest_evac_km",
  "historical_flood_count"
)

# Add the response variable to the list
vars_full <- c(vars, "flood_risk_score")

# Subset the dataset to only the variables of interest
subset_df <- TrainData[, vars_full]

# Compute missing values for each selected variable
missing_counts <- sapply(subset_df, function(x) sum(is.na(x)))

# Display the results as a data frame for readability
missing_table <- data.frame(
  Variable = names(missing_counts),
  Missing_Values = missing_counts
)

missing_table
##                                              Variable Missing_Values
## elevation_m                               elevation_m              0
## distance_to_river_m               distance_to_river_m              0
## rainfall_7d_mm                         rainfall_7d_mm              0
## monthly_rainfall_mm               monthly_rainfall_mm              0
## drainage_index                         drainage_index              0
## ndvi                                             ndvi              0
## ndwi                                             ndwi              0
## population_density_per_km2 population_density_per_km2              0
## built_up_percent                     built_up_percent              0
## infrastructure_score             infrastructure_score              0
## nearest_hospital_km               nearest_hospital_km              0
## nearest_evac_km                       nearest_evac_km              0
## historical_flood_count         historical_flood_count              0
## flood_risk_score                     flood_risk_score              0

1.5.6 Bivariate exploration of predictors and the response variable

To complement the univariate analysis, we now examine how each predictor relates to the response variable flood_risk_score. This bivariate exploration is essential for identifying the strength, direction, and linearity of potential associations, and for anticipating modelling challenges such as heteroscedasticity or non-linear patterns.

We organise the analysis into two parts:

  • Continuous predictors: scatterplots with linear smoothing lines allow us to visually assess whether variables such as elevation_m, distance_to_river_m, rainfall_7d_mm, etc. display monotonic trends with flood_risk_score.

  • Categorical predictors: predictors such as landcover, soil_type, urban_rural, water_supply, etc. are explored using boxplots and distribution comparisons, which reveal whether different categories correspond to systematically higher or lower flood risk.

We begin with environmental predictors that directly influence flood generation mechanisms.

# Custom theme
custom_theme <- theme_minimal(base_size = 14) +
  theme(
    plot.title = element_text(face = "bold", colour = "#003366"),
    axis.title = element_text(colour = "grey20"),
    panel.grid.minor = element_blank(),
    panel.grid.major = element_line(colour = "grey85")
  )

# Lighter points, darker line
pt_col     <- "#6baed6"  # light blue for points
smooth_col <- "#08306b"  # very dark blue for line

p1 <- ggplot(TrainData, aes(x = elevation_m, y = flood_risk_score)) +
  geom_point(alpha = 0.20, colour = pt_col) +
  geom_smooth(method = "lm", se = FALSE, colour = smooth_col, linewidth = 1.4) +
  labs(
    title = "Elevation vs. flood risk",
    x = "Elevation (m)",
    y = "Flood risk score"
  ) +
  custom_theme

p2 <- ggplot(TrainData, aes(x = distance_to_river_m, y = flood_risk_score)) +
  geom_point(alpha = 0.20, colour = pt_col) +
  geom_smooth(method = "lm", se = FALSE, colour = smooth_col, linewidth = 1.4) +
  labs(
    title = "Distance to river vs. flood risk",
    x = "Distance to nearest river (m)",
    y = "Flood risk score"
  ) +
  custom_theme

p3 <- ggplot(TrainData, aes(x = drainage_index, y = flood_risk_score)) +
  geom_point(alpha = 0.20, colour = pt_col) +
  geom_smooth(method = "lm", se = FALSE, colour = smooth_col, linewidth = 1.4) +
  labs(
    title = "Drainage index vs. flood risk",
    x = "Drainage index (0–1)",
    y = "Flood risk score"
  ) +
  custom_theme

(p1 | p2) /
(p3 | patchwork::plot_spacer())

Rainfall intensity and vegetation indices reflect hydrometeorological and land-surface conditions that strongly modulate the buildup, absorption and runoff of water. Therefore, examining their relationships with flood_risk_score helps reveal how recent precipitation and surface characteristics contribute to flood exposure patterns.

# Reusing the custom theme and colors defined before
custom_theme <- theme_minimal(base_size = 14) +
  theme(
    plot.title = element_text(face = "bold", colour = "#003366"),
    axis.title = element_text(colour = "grey20"),
    panel.grid.minor = element_blank(),
    panel.grid.major = element_line(colour = "grey85")
  )

pt_col     <- "#6baed6"  # light blue points
smooth_col <- "#08306b"  # dark blue regression line

# Rainfall 7 days
p4 <- ggplot(TrainData, aes(x = rainfall_7d_mm, y = flood_risk_score)) +
  geom_point(alpha = 0.20, colour = pt_col) +
  geom_smooth(method = "lm", se = FALSE, colour = smooth_col, linewidth = 1.4) +
  labs(
    title = "Rainfall (7d) vs. flood risk",
    x = "Rainfall in the last 7 days (mm)",
    y = "Flood risk score"
  ) +
  custom_theme

# Monthly Rainfall
p5 <- ggplot(TrainData, aes(x = monthly_rainfall_mm, y = flood_risk_score)) +
  geom_point(alpha = 0.20, colour = pt_col) +
  geom_smooth(method = "lm", se = FALSE, colour = smooth_col, linewidth = 1.4) +
  labs(
    title = "Monthly rainfall vs. flood risk",
    x = "Monthly rainfall (mm)",
    y = "Flood risk score"
  ) +
  custom_theme

# NDVI
p6 <- ggplot(TrainData, aes(x = ndvi, y = flood_risk_score)) +
  geom_point(alpha = 0.20, colour = pt_col) +
  geom_smooth(method = "lm", se = FALSE, colour = smooth_col, linewidth = 1.4) +
  labs(
    title = "NDVI vs. flood risk",
    x = "NDVI (−1 to 1)",
    y = "Flood risk score"
  ) +
  custom_theme

# NDWI
p7 <- ggplot(TrainData, aes(x = ndwi, y = flood_risk_score)) +
  geom_point(alpha = 0.20, colour = pt_col) +
  geom_smooth(method = "lm", se = FALSE, colour = smooth_col, linewidth = 1.4) +
  labs(
    title = "NDWI vs. flood risk",
    x = "NDWI (−1 to 1)",
    y = "Flood risk score"
  ) +
  custom_theme

# Combine into a 2×2 panel
(p4 | p5) /
(p6 | p7)

The following panel displays scatterplots for demographic intensity, built-up area, accessibility to emergency services and historical flood counts against the response variable, providing insight into how these socio-environmental factors relate to estimated flood risk.

# Reusing the custom theme and colors defined before
custom_theme <- theme_minimal(base_size = 14) +
  theme(
    plot.title = element_text(face = "bold", colour = "#003366"),
    axis.title = element_text(colour = "grey20"),
    panel.grid.minor = element_blank(),
    panel.grid.major = element_line(colour = "grey85")
  )

pt_col     <- "#6baed6"  # light blue points
smooth_col <- "#08306b"  # dark blue regression line

# Population density
p8 <- ggplot(TrainData, aes(x = population_density_per_km2, y = flood_risk_score)) +
  geom_point(alpha = 0.20, colour = pt_col) +
  geom_smooth(method = "lm", se = FALSE, colour = smooth_col, linewidth = 1.2) +
  labs(
    title = "Population density",
    x = "People per km²",
    y = "Flood risk score"
  ) +
  custom_theme

# Built-up percent
p9 <- ggplot(TrainData, aes(x = built_up_percent, y = flood_risk_score)) +
  geom_point(alpha = 0.20, colour = pt_col) +
  geom_smooth(method = "lm", se = FALSE, colour = smooth_col, linewidth = 1.2) +
  labs(
    title = "Built-up area",
    x = "Built-up area (%)",
    y = "Flood risk score"
  ) +
  custom_theme

# Infrastructure score
p10 <- ggplot(TrainData, aes(x = infrastructure_score, y = flood_risk_score)) +
  geom_point(alpha = 0.20, colour = pt_col) +
  geom_smooth(method = "lm", se = FALSE, colour = smooth_col, linewidth = 1.2) +
  labs(
    title = "Infrastructure score",
    x = "Score (0–100)",
    y = "Flood risk score"
  ) +
  custom_theme

# Distance to nearest hospital
p11 <- ggplot(TrainData, aes(x = nearest_hospital_km, y = flood_risk_score)) +
  geom_point(alpha = 0.20, colour = pt_col) +
  geom_smooth(method = "lm", se = FALSE, colour = smooth_col, linewidth = 1.2) +
  labs(
    title = "Distance to hospital",
    x = "km",
    y = "Flood risk score"
  ) +
  custom_theme

# Distance to nearest evacuation point
p12 <- ggplot(TrainData, aes(x = nearest_evac_km, y = flood_risk_score)) +
  geom_point(alpha = 0.20, colour = pt_col) +
  geom_smooth(method = "lm", se = FALSE, colour = smooth_col, linewidth = 1.2) +
  labs(
    title = "Distance to evacuation pt",
    x = "km",
    y = "Flood risk score"
  ) +
  custom_theme

# Historical flood count
p13 <- ggplot(TrainData, aes(x = historical_flood_count, y = flood_risk_score)) +
  geom_point(alpha = 0.20, colour = pt_col) +
  geom_smooth(method = "lm", se = FALSE, colour = smooth_col, linewidth = 1.2) +
  labs(
    title = "Historical flood count",
    x = "Number of past floods",
    y = "Flood risk score"
  ) +
  custom_theme

# 3×2 layout
(p8  | p9  | p10) /
(p11 | p12 | p13)

Categorical predictors describe qualitative characteristics of each location, such as land use, soil properties or access to basic services. To investigate how these factors relate to flood_risk_score, we compare the distribution of the response across categories of landcover, soil_type, urban_rural, water_supply, electricity and road_quality using boxplots. These visualisations indicate whether certain classes are systematically associated with higher or lower estimated flood risk.

custom_theme <- theme_minimal(base_size = 11) +
  theme(
    plot.title = element_text(face = "bold", colour = "#003366", size = 11),
    axis.title = element_text(colour = "grey20"),
    axis.text.x = element_text(angle = 25, hjust = 1),
    panel.grid.minor = element_blank(),
    panel.grid.major = element_line(colour = "grey85")
  )

box_col <- "#6baed6"

p_cat1 <- ggplot(TrainData, aes(x = landcover, y = flood_risk_score)) +
  geom_boxplot(fill = box_col, alpha = 0.7, outlier.alpha = 0.4) +
  labs(title = "Landcover", x = "Land-cover type", y = "Flood risk score") +
  custom_theme

p_cat2 <- ggplot(TrainData, aes(x = soil_type, y = flood_risk_score)) +
  geom_boxplot(fill = box_col, alpha = 0.7, outlier.alpha = 0.4) +
  labs(title = "Soil type", x = "Soil type", y = "Flood risk score") +
  custom_theme

p_cat3 <- ggplot(TrainData, aes(x = urban_rural, y = flood_risk_score)) +
  geom_boxplot(fill = box_col, alpha = 0.7, outlier.alpha = 0.4) +
  labs(title = "Urban vs rural", x = "Category", y = "Flood risk score") +
  custom_theme

p_cat4 <- ggplot(TrainData, aes(x = water_supply, y = flood_risk_score)) +
  geom_boxplot(fill = box_col, alpha = 0.7, outlier.alpha = 0.4) +
  labs(title = "Water supply", x = "Water supply source", y = "Flood risk score") +
  custom_theme

p_cat5 <- ggplot(TrainData, aes(x = electricity, y = flood_risk_score)) +
  geom_boxplot(fill = box_col, alpha = 0.7, outlier.alpha = 0.4) +
  labs(title = "Electricity", x = "Electricity availability", y = "Flood risk score") +
  custom_theme

p_cat6 <- ggplot(TrainData, aes(x = road_quality, y = flood_risk_score)) +
  geom_boxplot(fill = box_col, alpha = 0.7, outlier.alpha = 0.4) +
  labs(title = "Road quality", x = "Road quality", y = "Flood risk score") +
  custom_theme

(p_cat1 | p_cat2 | p_cat3) /
(p_cat4 | p_cat5 | p_cat6)

The exploratory analysis reveals a diverse set of relationships between the predictors and the response variable flood_risk_score. The main findings can be summarised as follows:

  • Geographical predictors:

    • Both elevation_m and distance_to_river_m display weak but interpretable negative associations with flood risk. Lower elevations and locations closer to rivers tend to exhibit higher flood exposure, consistent with hydrological expectations.
  • Meteorological variables: rainfall_7d_mm and monthly_rainfall_mm show clear positive linear trends with the response, indicating that recent and accumulated precipitation substantially increases estimated flood risk.

  • Vegetation indices ndvi and ndwi exhibit negligible linear relationships, suggesting that surface-cover characteristics exert only modest influence in the synthetic risk model.

  • Socio-environmental predictors:

    • built_up_percent and population_density_per_km2 show very weak positive associations with flood risk.
    • infrastructure_score shows a slight negative association, consistent with better infrastructure contributing to reduced vulnerability.
    • Accessibility measures such as nearest_hospital_km and nearest_evac_km show almost no linear pattern.
    • historical_flood_count stands out with a clearly increasing trend, making it one of the strongest predictors in the dataset.
  • Categorical predictors:

    • Certain land-use and socio-geographical categories, such as wetlands and urban areas, tend to present higher flood risk levels.
    • Other categories, such as forested areas or soil types associated with better drainage, show generally lower flood risk.
    • However, the substantial overlap across most categories indicates that many qualitative predictors exert only modest marginal influence.

1.5.7 Multicollinearity and correlation analysis

Before fitting linear regression models, it is essential to evaluate whether strong linear relationships exist among the selected continuous predictors. High multicollinearity can inflate coefficient variances, reduce model interpretability, and weaken statistical inference.

We restrict the analysis to the continuous predictors selected in the previous subsection:

  • elevation_m
  • distance_to_river_m
  • rainfall_7d_mm
  • monthly_rainfall_mm
  • drainage_index
  • ndvi
  • ndwi
  • population_density_per_km2
  • built_up_percent
  • infrastructure_score
  • nearest_hospital_km
  • nearest_evac_km
  • historical_flood_count
# Step 1: We define the list of continuous predictor names
vars <- c(
  "elevation_m",
  "distance_to_river_m",
  "rainfall_7d_mm",
  "monthly_rainfall_mm",
  "drainage_index",
  "ndvi",
  "ndwi",
  "population_density_per_km2",
  "built_up_percent",
  "infrastructure_score",
  "nearest_hospital_km",
  "nearest_evac_km",
  "historical_flood_count"
)

# Step 2: We select these variables from the training dataset
cont_vars <- TrainData |>
  dplyr::select(dplyr::all_of(vars))

# Step 3: We compute the Pearson correlation matrix
cor_matrix <- cor(
  cont_vars,
  use   = "pairwise.complete.obs",
  method = "pearson"
)

# Print the numeric matrix rounded to 2 decimals to inspect it
# round(cor_matrix, 2)

# Step 4: We reshape the matrix to a long format suitable for ggplot2
cor_df <- as.data.frame(as.table(cor_matrix))
names(cor_df) <- c("Var1", "Var2", "Correlation")

# Step 5: We create a custom heatmap with ggplot
ggplot(cor_df, aes(x = Var1, y = Var2, fill = Correlation)) +
  geom_tile() +
  geom_text(aes(label = sprintf("%.2f", Correlation)), size = 3) +
  scale_fill_gradient2(
    low  = "#c6dbef",
    mid  = "white",
    high = "#08306b",
    midpoint = 0,
    limits = c(-1, 1)
  ) +
  coord_equal() +
  labs(
    title = "Correlation matrix of continuous predictors",
    x = NULL,
    y = NULL
  ) +
  theme_minimal(base_size = 11) +
  theme(
    plot.title  = element_text(face = "bold", size = 14, hjust = 0.5),
    axis.text.x = element_text(angle = 45, hjust = 1),
    panel.grid  = element_blank()
  )

The correlation matrix of the 13 continuous predictors shows that overall multicollinearity is extremely low:

  • Most pairwise correlations are close to 0, indicating that the predictors capture largely distinct aspects of the physical and socio-environmental system.
  • The highest correlations are still very small (around 0.07 or −0.16), far below any level considered problematic.
  • In hydrological and environmental datasets, variables such as rainfall_7d_mm, monthly_rainfall_mm, distance_to_river_mand elevation_m could theoretically show moderate correlations, but in this synthetic dataset they are almost independent.
  • Historical variables as historical_flood_count are also weakly related to geographic or meteorological predictors, reflecting the dataset’s construction.
  • We can conclude that all the selected predictors are safe to include simultaneously in linear regression models without risk of unstable coefficient estimates.

1.6 Linear Regression Modelling

1.6.1 Quantifying associations between predictors and the response variable

Before specifying the linear regression models, and completing the work performed in the previous section, it is useful to quantify the strength of the association between each predictor and the response variable flood_risk_score.

Evaluating these associations provides an objective foundation for selecting predictors, identifying irrelevant variables, and anticipating potential modelling challenges.

Given that the dataset contains both continuous and categorical predictors, we evaluate their relationships with the response using appropriate measures:

  • For continuous predictors, we compute Pearson correlation coefficients, which capture the direction and magnitude of linear associations.
  • For categorical predictors, where correlations are not defined, we use one-way ANOVA and report eta-squared (η²) as a measure of effect size, quantifying the proportion of variance in flood_risk_score explained by category membership.
  • For binary categorical variables, we additionally inspect group mean differences to provide an intuitive comparison between categories.
# List of continuous predictors
cont_predictors <- c(
  "elevation_m", "distance_to_river_m", "rainfall_7d_mm", "monthly_rainfall_mm",
  "drainage_index", "ndvi", "ndwi", "population_density_per_km2",
  "built_up_percent", "infrastructure_score",
  "nearest_hospital_km", "nearest_evac_km", "historical_flood_count"
)

# List of categorical predictors
cat_predictors <- c("landcover", "soil_type", "urban_rural",
                    "water_supply", "electricity", "road_quality")

# Continuous predictors: correlations
cont_assoc <- cont_predictors |>
  lapply(function(v) {
    cor_val <- cor(TrainData[[v]], TrainData$flood_risk_score,
                   use = "pairwise.complete.obs", method = "pearson")
    tibble(Predictor = v, Type = "Continuous", Association = cor_val)
  }) |>
  bind_rows()

# Categorical predictors: ANOVA effect size (Eta-squared)
cat_assoc <- cat_predictors |>
  lapply(function(v) {
    # I build the formula flood_risk_score ~ <predictor>
    fmla <- as.formula(paste("flood_risk_score ~", v))
    model <- aov(fmla, data = TrainData)
    eta_sq <- etaSquared(model, anova = TRUE)[1, "eta.sq"]
    tibble(Predictor = v, Type = "Categorical", Association = eta_sq)
  }) |>
  bind_rows()

# Combining and displaying the results
summary_table <- bind_rows(cont_assoc, cat_assoc)

kable(
  summary_table,
  caption = "Association between predictors and flood risk score",
  digits = 3
)
Association between predictors and flood risk score
Predictor Type Association
elevation_m Continuous -0.112
distance_to_river_m Continuous -0.147
rainfall_7d_mm Continuous 0.409
monthly_rainfall_mm Continuous 0.769
drainage_index Continuous -0.146
ndvi Continuous -0.013
ndwi Continuous 0.002
population_density_per_km2 Continuous 0.007
built_up_percent Continuous 0.060
infrastructure_score Continuous -0.142
nearest_hospital_km Continuous -0.002
nearest_evac_km Continuous 0.002
historical_flood_count Continuous 0.229
landcover Categorical 0.000
soil_type Categorical 0.000
urban_rural Categorical 0.000
water_supply Categorical 0.000
electricity Categorical 0.001
road_quality Categorical 0.003

For categorical variables, the η² values are close to zero, indicating that group membership (e.g., land-cover class, soil type, urban/rural status, access to services) explains very little variance in the response variable when considered individually. This suggests that their influence is limited in a marginal (univariate) sense and may become relevant only when combined with other predictors in a multivariate modelling context.

1.6.2 Baseline simple linear regression model

Given the correlation analysis conducted in the previous subsection, we now refine our focus to those continuous predictors that exhibited the strongest marginal associations with the response variable flood_risk_score.

In particular, monthly_rainfall_mm, rainfall_7d_mm, historical_flood_count,distance_to_river_m, drainage_index, infrastructure_score and elevation_m demonstrated the largest absolute correlations, indicating a potentially meaningful linear relationship with the flood risk score when considered individually.

To further quantify the explanatory power of these predictors, we fit a series of simple linear regression models of the form

\[ flood\_risk\_score = \beta_0 + \beta_1 X + \varepsilon , \]

where \(X\) denotes each selected predictor in isolation.

This step serves two purposes:

  • Provide an interpretable measure of the individual effect of each predictor on the response, free from confounding by other variables.

  • Establish a baseline for comparison before transitioning to multivariate linear regression models that incorporate multiple predictors simultaneously.

By analysing the estimated coefficients, statistical significance and \(R^2\) values of these simple models, we obtain a clearer picture of which environmental and socio-hydrological factors have the strongest direct influence on flood risk in a univariate setting.

# Simple linear regression: flood_risk_score ~ monthly_rainfall_mm
lm_monthly <- lm(flood_risk_score ~ monthly_rainfall_mm, data = TrainData)
summary(lm_monthly)
## 
## Call:
## lm(formula = flood_risk_score ~ monthly_rainfall_mm, data = TrainData)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -22.211  -7.214  -0.934   5.997  39.308 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         1.536e+01  1.248e-01   123.1   <2e-16 ***
## monthly_rainfall_mm 8.287e-02  4.873e-04   170.1   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9.47 on 19996 degrees of freedom
## Multiple R-squared:  0.5912, Adjusted R-squared:  0.5912 
## F-statistic: 2.892e+04 on 1 and 19996 DF,  p-value: < 2.2e-16
# Simple linear regression: flood_risk_score ~ rainfall_7d_mm
lm_rain7d <- lm(flood_risk_score ~ rainfall_7d_mm, data = TrainData)
summary(lm_rain7d)
## 
## Call:
## lm(formula = flood_risk_score ~ rainfall_7d_mm, data = TrainData)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -32.875  -9.942  -0.985   8.752  63.059 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    24.954723   0.162528   153.5   <2e-16 ***
## rainfall_7d_mm  0.103423   0.001634    63.3   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 13.52 on 19996 degrees of freedom
## Multiple R-squared:  0.1669, Adjusted R-squared:  0.1669 
## F-statistic:  4006 on 1 and 19996 DF,  p-value: < 2.2e-16
# Simple linear regression: flood_risk_score ~ historical_flood_count
lm_hist <- lm(flood_risk_score ~ historical_flood_count, data = TrainData)
summary(lm_hist)
## 
## Call:
## lm(formula = flood_risk_score ~ historical_flood_count, data = TrainData)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -37.532 -10.510  -0.895   9.469  61.145 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)             31.9151     0.1099  290.49   <2e-16 ***
## historical_flood_count   6.6972     0.2016   33.22   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 14.42 on 19996 degrees of freedom
## Multiple R-squared:  0.05231,    Adjusted R-squared:  0.05226 
## F-statistic:  1104 on 1 and 19996 DF,  p-value: < 2.2e-16
# Simple linear regression: flood_risk_score ~ distance_to_river_m
lm_dist <- lm(flood_risk_score ~ distance_to_river_m, data = TrainData)
summary(lm_dist)
## 
## Call:
## lm(formula = flood_risk_score ~ distance_to_river_m, data = TrainData)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -34.507 -10.631  -0.965   9.544  71.919 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          3.547e+01  1.470e-01  241.25   <2e-16 ***
## distance_to_river_m -1.090e-03  5.178e-05  -21.05   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 14.65 on 19996 degrees of freedom
## Multiple R-squared:  0.02167,    Adjusted R-squared:  0.02162 
## F-statistic: 442.9 on 1 and 19996 DF,  p-value: < 2.2e-16
# Simple linear regression: flood_risk_score ~ drainage_index
lm_drain <- lm(flood_risk_score ~ drainage_index, data = TrainData)
summary(lm_drain)
## 
## Call:
## lm(formula = flood_risk_score ~ drainage_index, data = TrainData)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -35.178 -10.687  -0.956   9.470  67.543 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     38.1005     0.2541   149.9   <2e-16 ***
## drainage_index  -9.6346     0.4632   -20.8   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 14.65 on 19996 degrees of freedom
## Multiple R-squared:  0.02118,    Adjusted R-squared:  0.02113 
## F-statistic: 432.6 on 1 and 19996 DF,  p-value: < 2.2e-16
# Simple linear regression: flood_risk_score ~ infrastructure_score
lm_infra <- lm(flood_risk_score ~ infrastructure_score, data = TrainData)
summary(lm_infra)
## 
## Call:
## lm(formula = flood_risk_score ~ infrastructure_score, data = TrainData)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -33.882 -10.657  -0.916   9.548  66.670 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          38.163818   0.262911  145.16   <2e-16 ***
## infrastructure_score -0.105187   0.005197  -20.24   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 14.66 on 19996 degrees of freedom
## Multiple R-squared:  0.02007,    Adjusted R-squared:  0.02002 
## F-statistic: 409.6 on 1 and 19996 DF,  p-value: < 2.2e-16
# Simple linear regression: flood_risk_score ~ elevation_m
lm_elev <- lm(flood_risk_score ~ elevation_m, data = TrainData)
summary(lm_elev)
## 
## Call:
## lm(formula = flood_risk_score ~ elevation_m, data = TrainData)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -34.029 -10.706  -0.869   9.639  65.782 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 34.2338398  0.1202680  284.65   <2e-16 ***
## elevation_m -0.0051327  0.0003222  -15.93   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 14.72 on 19996 degrees of freedom
## Multiple R-squared:  0.01253,    Adjusted R-squared:  0.01248 
## F-statistic: 253.7 on 1 and 19996 DF,  p-value: < 2.2e-16

The following conclusions are drawn from the simple linear regression models fitted using the predictors that showed the strongest marginal associations with the response variable flood_risk_score:

  • monthly_rainfall_mm
    • This predictor exhibits the highest individual explanatory power, with an \(R²\) of approximately 0.59.
    • The estimated coefficient is large and highly significant (\(p < 2×10^{-16}\)) indicating a strong positive association between accumulated monthly rainfall and flood risk.
    • The residual standard error (≈ 9.47) is substantially lower than in the other models, reinforcing its dominant influence on the synthetic flood-risk score.
  • rainfall_7d_mm
    • This variable also shows a statistically significant positive relationship with flood risk (\(p < 2×10^{-16}\)), consistent with hydrological expectations.
    • Its explanatory power is more modest (\(R²\) ≈ 0.166), suggesting that short-term rainfall contributes to flood risk but less strongly than monthly accumulation.
    • The direction and magnitude of the estimated effect confirm that recent precipitation still plays a meaningful role in shaping flood-risk levels.
  • historical_flood_count
    • The estimated effect is positive and highly significant (\(p < 2×10^{-16}\)), indicating that locations with more recorded flood events tend to have higher risk scores.
    • Although the explanatory power is comparatively low (\(R²\) ≈ 0.052), this predictor remains relevant, likely capturing structural vulnerability, local geomorphology or missing hydrological processes.
  • distance_to_river_m
    • Displays a significant negative association with flood risk (\(p < 2×10^{-16}\)), consistent with hydrological reasoning: locations closer to rivers tend to exhibit higher flood risk.
    • Explanatory power is modest (\(R²\) ≈ 0.021), suggesting that while river proximity matters, its marginal effect is weaker than rainfall-driven predictors.
  • drainage_index
    • Shows a statistically significant negative effect (\(p < 2×10^{-16}\)), indicating that better drainage conditions are associated with reduced flood risk.
    • Its explanatory power is similar to distance_to_river_m (\(R²\) ≈ 0.021), reinforcing the idea that infiltration and terrain drainage capacity mitigate floodwater accumulation.
  • infrastructure_score
    • Exhibits a significant negative association with flood risk (\(p < 2×10^{-16}\)). Locations with better infrastructure tend to have slightly lower flood-risk scores.
    • The explanatory power is relatively weak (\(R²\) ≈ 0.02), indicating that structural preparedness cannot offset strong hydrometeorological drivers.
  • elevation_m
    • Shows a significant negative association (\(p < 2×10^{-16}\)), reflecting that lower-elevation areas face greater flood risk.
    • Explanatory power remains small (\(R²\) ≈ 0.012), but the effect direction aligns with hydrological behaviour, floodwaters accumulate in low-lying terrain.

Overall, these results show that monthly_rainfall_mm is by far the strongest marginal predictor of flood_risk_score, while rainfall_7d_mm and historical_flood_count provide more moderate contributions.

1.6.3 Building Multiple Linear Regression Models

We now proceed to construct a series of multiple linear regression models and to evaluate their predictive performance on the hold-out test set.

Comparing the models on both explanatory power and out-of-sample accuracy will guide the selection of a final regression specification for detailed diagnostic analysis.

We start with a full specification that includes all continuous predictors which exhibited meaningful marginal associations with the response variable flood_risk_score:

  • monthly_rainfall_mm
  • rainfall_7d_mm
  • historical_flood_count
  • distance_to_river_m
  • drainage_index
  • infrastructure_score
  • elevation_m
# Model 1: Full relevant model

lm_full_rel <- lm(
  flood_risk_score ~ monthly_rainfall_mm +
                     rainfall_7d_mm +
                     historical_flood_count +
                     distance_to_river_m +
                     drainage_index +
                     infrastructure_score +
                     elevation_m,
  data = TrainData
)

# Test-set predictions and performance
pred_full_rel <- predict(lm_full_rel, newdata = TestData)

R2_test_full_rel  <- cor(TestData$flood_risk_score, pred_full_rel)^2
RMSE_test_full_rel <- sqrt(
  mean((TestData$flood_risk_score - pred_full_rel)^2)
)

We next consider a more parsimonious specification that retains only the 3 strongest predictors identified in the exploratory and simple regression analyses. This model incorporates the two rainfall variables monthly_rainfall_mm and rainfall_7d_mm together with historical_flood_count, capturing the dominant meteorological drivers and the accumulated vulnerability of each location. By focusing on these three predictors, we aim to evaluate whether a simpler formulation can achieve competitive predictive accuracy while improving interpretability.

# Model 2: Strong rainfall + history

lm_strong3 <- lm(
  flood_risk_score ~ monthly_rainfall_mm +
                     rainfall_7d_mm +
                     historical_flood_count,
  data = TrainData
)

pred_strong3 <- predict(lm_strong3, newdata = TestData)
R2_test_strong3   <- cor(TestData$flood_risk_score, pred_strong3)^2
RMSE_test_strong3 <- sqrt(mean((TestData$flood_risk_score - pred_strong3)^2))

We also evaluate a simplified rainfall-only specification, which includes monthly_rainfall_mm and rainfall_7d_mm as the sole predictors. This model isolates the direct meteorological drivers of flood risk to assess how much predictive power can be achieved without incorporating additional environmental or structural factors.

# Model 3: Rainfall-only model

lm_rain <- lm(
  flood_risk_score ~ monthly_rainfall_mm +
                     rainfall_7d_mm,
  data = TrainData
)

pred_rain <- predict(lm_rain, newdata = TestData)
R2_test_rain   <- cor(TestData$flood_risk_score, pred_rain)^2
RMSE_test_rain <- sqrt(mean((TestData$flood_risk_score - pred_rain)^2))

We further estimate a hydro-geomorphological model that integrates key physical determinants of flood generation. In addition to the two rainfall measures, this specification includes distance_to_river_m, drainage_index and elevation_m, allowing us to capture the combined influence of precipitation, terrain structure, and hydrological context on flood risk.

# Model 4: Hydro-geomorphological model

lm_hydro <- lm(
  flood_risk_score ~ monthly_rainfall_mm +
                     rainfall_7d_mm +
                     distance_to_river_m +
                     drainage_index +
                     elevation_m,
  data = TrainData
)

pred_hydro <- predict(lm_hydro, newdata = TestData)
R2_test_hydro   <- cor(TestData$flood_risk_score, pred_hydro)^2
RMSE_test_hydro <- sqrt(mean((TestData$flood_risk_score - pred_hydro)^2))

Finally, we estimate a comprehensive model that incorporates all available predictors, including both continuous and categorical variables.

# Model 5: All predictors (continuous + categorical)

lm_all <- lm(
  flood_risk_score ~ elevation_m +
                     distance_to_river_m +
                     rainfall_7d_mm +
                     monthly_rainfall_mm +
                     drainage_index +
                     ndvi +
                     ndwi +
                     population_density_per_km2 +
                     built_up_percent +
                     infrastructure_score +
                     nearest_hospital_km +
                     nearest_evac_km +
                     historical_flood_count +
                     landcover +
                     soil_type +
                     urban_rural +
                     water_supply +
                     electricity +
                     road_quality,
  data = TrainData
)

# Test-set predictive performance
pred_all       <- predict(lm_all, newdata = TestData)
R2_test_all    <- cor(TestData$flood_risk_score, pred_all)^2
RMSE_test_all  <- sqrt(mean((TestData$flood_risk_score - pred_all)^2))

Note: Categorical predictors do not require explicit encoding when fitting classical multiple linear regression models using lm(), as R automatically applies appropriate dummy-variable coding.

We now summarise the predictive performance of all fitted multiple linear regression models by compiling their test-set \(R^2\) and \(RMSE\) values into a single comparison table.

# Collect test-set performance
mlr_perf <- tibble(
  Model = c("Full relevant",
            "Strong rainfall + history",
            "Rainfall only",
            "Hydro-geomorphological",
            "All predictors"),
  R2_test = c(R2_test_full_rel,
              R2_test_strong3,
              R2_test_rain,
              R2_test_hydro,
              R2_test_all),
  RMSE_test = c(RMSE_test_full_rel,
                RMSE_test_strong3,
                RMSE_test_rain,
                RMSE_test_hydro,
                RMSE_test_all)
)

# Print as a clean table (no tibble header)
as.data.frame(mlr_perf)
##                       Model   R2_test RMSE_test
## 1             Full relevant 0.8591315  5.506455
## 2 Strong rainfall + history 0.7930970  6.672519
## 3             Rainfall only 0.7455199  7.399476
## 4    Hydro-geomorphological 0.7995043  6.568641
## 5            All predictors 0.8634650  5.421057

The comparison of the alternative multiple linear regression models on the test set leads to the following conclusions:

  • The “All predictors” model achieves the best predictive performance, with the highest test-set \(R^2 \approx 0.863\) and the lowest RMSE (≈ 5.42). The improvement over the “Full relevant” model is modest but consistent, suggesting that adding the remaining predictors and categorical factors yields a small yet measurable gain in accuracy.

  • The “Full relevant” model (only continuous predictors with stronger marginal associations) performs almost as well (\(R^2 \approx 0.859\), RMSE ≈ 5.51). This indicates that most of the predictive power is already captured by the key environmental and hydrometeorological variables identified in the exploratory analysis.

  • Simpler specifications that rely solely on rainfall (“Rainfall only”) or rainfall plus historical events (“Strong rainfall + history”) show clearly lower \(R^2\) (between 0.74 and 0.79) and higher RMSE, confirming that incorporating additional terrain and infrastructural information substantially improves out-of-sample predictions.

  • The “Hydro-geomorphological” model, which combines rainfall with distance to rivers, drainage and elevation, performs better than the rainfall-only model but remains slightly inferior to the full continuous and all-predictor models. This suggests that, while hydro-geomorphological factors are important, they do not fully substitute the information carried by historical events or socio-environmental variables.

Overall, these results justify selecting the “All predictors” model as the final specification for diagnostic analysis, while noting that the “Full relevant” model offers a competitive and more parsimonious alternative with very similar predictive performance.

1.6.4 Model Diagnosis

We now assess whether the final selected model (“All predictors”) satisfies the classical linear regression assumptions by examining the behaviour of the residuals and identifying potential influential observations.

We begin by checking homoscedasticity, the assumption that the variance of residuals is constant across fitted values. This is inspected using a plot of standardized residuals versus fitted values, where a random cloud of points around zero would indicate no visible pattern.

To assess independence, we inspect the residuals across the observation index. A lack of structure in this plot would suggest that the residuals do not exhibit serial dependence.

par(mfrow = c(1, 2))

resid_std <- rstandard(lm_all)

# Plot 1: Residuals vs Fitted
plot(fitted(lm_all), resid_std,
     xlab = "Fitted values", 
     ylab = "Standardized residuals",
     main = "Residuals vs Fitted",
     pch = 16, col = "darkred", cex = 0.65)
abline(h = 0, col = "gray40", lty = 2, lwd = 1.3)

# Plot 2: Residuals vs Observation Index
plot(1:length(resid_std), resid_std,
     xlab = "Observation Index", 
     ylab = "Standardized residuals",
     main = "Residuals vs Obs. Index",
     pch = 16, col = "darkblue", cex = 0.65)
abline(h = 0, col = "gray40", lty = 2, lwd = 1.3)

  • The Residuals vs Fitted plot shows a broadly symmetric cloud around zero with no strong funnel shape, suggesting that variance is approximately constant across fitted values. Minor deviations are present but not severe.

  • The Residuals vs Observation Index plot does not exhibit visible time-like patterns or clustering, indicating no obvious serial dependence.

To assess whether residuals follow a normal distribution, we use a Q-Q plot. A linear pattern in the Q-Q plot support the assumption of normality.

res <- resid(lm_all)
qqnorm(res, main = "Normal Q–Q Plot of Residuals")
qqline(res, col = "blue", lwd = 2)

The Q–Q plot shows a largely linear pattern in the central region, with moderate deviations in the tails, indicating that the residuals follow an approximately normal distribution.

The Scale–Location plot is used to further assess the homoscedasticity assumption by visualising whether the spread of residuals remains constant across fitted values. A roughly horizontal pattern indicates stable variance, while systematic trends suggest heteroscedasticity.

# Compute standardized residuals
resid_std <- rstandard(lm_all)

# Plot: sqrt(|standardized residuals|) vs fitted values
plot(
  fitted(lm_all),
  sqrt(abs(resid_std)),
  xlab = "Fitted values",
  ylab = "Sqrt(|Standardized residuals|)",
  main = "Scale–Location Plot",
  pch = 16,
  col = "darkgreen",
  cex = 0.65
)

# Add a smooth trend line
lines(
  lowess(fitted(lm_all), sqrt(abs(resid_std))),
  col = "black",
  lwd = 2
)

In this case, the Scale–Location plot shows an approximately horizontal trend with no pronounced funnel shape, indicating that the variance of the residuals remains reasonably stable across fitted values. This suggests that the assumption of homoscedasticity is largely satisfied for the final model.

The Residuals vs Leverage plot is used to identify potentially influential observations that may disproportionately affect the fitted regression model. Points with high leverage and large residuals warrant closer inspection.

plot(
  lm_all,
  which = 5,
  pch = 16,
  col = "purple",
  cex = 0.7
)

We also assess influence diagnostics by visualizing standard influence measures for all observations, in order to identify potential high-leverage or influential points that may affect the stability of the fitted model.

influenceIndexPlot(lm_all)

  • The Residuals vs Leverage plot reveals a small number of observations with relatively higher leverage and residual magnitude. However, most points lie well within the Cook’s distance contours, indicating that no single observation exerts undue influence on the fitted regression coefficients.

  • The influence index plots (Cook’s distance, studentized residuals, hat values) confirm that, although a few observations warrant closer inspection, their overall impact on model stability is limited. The majority of observations exhibit low leverage and low influence.

1.6.5 Comparison with the Benchmark Model

We first define a simple benchmark (naïve) model that predicts flood_risk_scoreon the test set using only the mean value observed in the training set. This provides a baseline to judge whether our regression models deliver a meaningful improvement.

# First, we compute the benchmark prediction (in this case the mean)
bench_mean <- mean(TrainData$flood_risk_score, na.rm = TRUE)

# Then, we generate constant predictions for every test observation
pred_bench <- rep(bench_mean, nrow(TestData))
                  
# Finally we evaluate benchmark predictive performance on the test set
RMSE_test_bench <- sqrt(mean((TestData$flood_risk_score - pred_bench)^2, na.rm = TRUE))
MAE_test_bench  <- mean(abs(TestData$flood_risk_score - pred_bench), na.rm = TRUE)

cat("Benchmark mean (train):", round(bench_mean, 4), "\n")
## Benchmark mean (train): 33.2741
cat("Benchmark RMSE (test):", round(RMSE_test_bench, 4), "\n")
## Benchmark RMSE (test): 14.6682
cat("Benchmark MAE  (test):", round(MAE_test_bench, 4), "\n")
## Benchmark MAE  (test): 11.7959

Note: \(R^2\) via correlation is not defined here because pred_bench is constant.

  • The benchmark achieves a test RMSE of approximately 14.67 and a test MAE of approximately 11.80, reflecting the error incurred when ignoring all covariate information.
  • In contrast, the selected “All predictors” multiple linear regression model (lm_all) attains a substantially lower RMSE (≈ 5.42), indicating a large improvement in predictive accuracy.
  • This corresponds to a reduction of more than 63% in RMSE relative to the benchmark, demonstrating that the regression model captures meaningful structure in the data beyond the global mean.

1.7 Computing Prediction Intervals

In addition to point predictions, prediction intervals quantify the uncertainty around individual forecasts by accounting for both model uncertainty and irreducible noise in the response variable.

We compute 95% prediction intervals for the final selected linear regression model (lm_all) on the test set. These intervals provide a range within which future flood risk scores are expected to fall with a given confidence level.

# Computing prediction intervals on the test set
pred_int <- predict(
  lm_all,
  newdata = TestData,
  interval = "prediction",
  level = 0.95
)

# Storing predictions and intervals
TestData$Pred  <- pred_int[, "fit"]
TestData$Lower <- pred_int[, "lwr"]
TestData$Upper <- pred_int[, "upr"]

To visually assess the quality of the prediction intervals, we plot the observed flood risk scores together with the fitted values and their corresponding prediction bands.

TestData %>%
  arrange(Pred) %>%
  ggplot(aes(x = Pred)) +
  geom_point(aes(y = flood_risk_score),
             color = "blue", alpha = 0.5, size = 1.3) +
  geom_line(aes(y = Pred),
            color = "red", linewidth = 1) +
  geom_ribbon(aes(ymin = Lower, ymax = Upper),
              fill = "orange", alpha = 0.25) +
  labs(
    title = "Prediction Intervals vs Observed Flood Risk",
    subtitle = "Red line: Predicted values | Blue points: Observed values",
    x = "Predicted flood risk score",
    y = "Flood risk score"
  ) +
  theme_minimal()

We assess the empirical coverage by computing the proportion of test observations that fall within the predicted intervals.

# Counting observations outside the prediction intervals
outside_interval <- sum(
  TestData$flood_risk_score < TestData$Lower |
  TestData$flood_risk_score > TestData$Upper
)

# Computing coverage
total_obs <- nrow(TestData)
coverage <- round(100 * (1 - outside_interval / total_obs), 1)

cat("Percentage of observations inside the prediction intervals:", 
    coverage, "%\n")
## Percentage of observations inside the prediction intervals: 95.4 %

1.8 Application of the Tidymodels Framework

We now replicate our final linear regression model using the tidymodels framework, which standardises preprocessing (scaling + dummy encoding) and evaluates performance via cross-validation before fitting the final model and testing it on the hold-out set.

set.seed(123)

# 1) Preprocessing recipe (scale numeric + dummy categorical)

risk_recipe <- recipe(flood_risk_score ~ ., data = TrainData) %>%
  
  # Assign ID role to non-predictive variables
  update_role(
    record_id, district, place_name, latitude, longitude, generation_date,
    new_role = "id"
  ) %>%
  
  # Convert character predictors to factors
  step_string2factor(all_nominal_predictors()) %>%
  
  # Dummy encode categorical predictors
  step_dummy(all_nominal_predictors(), one_hot = TRUE) %>%
  
  # Center and scale numeric predictors
  step_normalize(all_numeric_predictors())

# 2) Model specification (linear regression via lm)
lin_spec <- linear_reg() %>%
  set_engine("lm")

# 3) Workflow
lin_wf <- workflow() %>%
  add_recipe(risk_recipe) %>%
  add_model(lin_spec)

# 4) Cross-validation (10-fold, 5 repeats) + metrics
cv_folds <- vfold_cv(TrainData, v = 10, repeats = 5)

cv_fit <- lin_wf %>%
  fit_resamples(
    resamples = cv_folds,
    metrics   = metric_set(rmse, rsq, mae),
    control   = control_resamples(save_pred = TRUE)
  )

cv_metrics <- cv_fit %>% collect_metrics()
cv_metrics
## # A tibble: 3 × 6
##   .metric .estimator  mean     n  std_err .config        
##   <chr>   <chr>      <dbl> <int>    <dbl> <chr>          
## 1 mae     standard   4.35     50 0.0147   pre0_mod0_post0
## 2 rmse    standard   5.25     50 0.0193   pre0_mod0_post0
## 3 rsq     standard   0.906    50 0.000967 pre0_mod0_post0
# 5) Fit final model on full training set + test-set evaluation
final_fit <- lin_wf %>% fit(data = TrainData)

test_predictions <- TestData %>%
  dplyr::select(-any_of("Pred")) %>%                       # avoid name clashes
  dplyr::bind_cols(
    predict(final_fit, new_data = TestData) %>%
      dplyr::rename(Pred = .pred)
  )

# (Optional) sanity check
# names(test_predictions)

test_metrics <- test_predictions %>%
  yardstick::metrics(truth = flood_risk_score, estimate = Pred)

test_metrics
## # A tibble: 3 × 3
##   .metric .estimator .estimate
##   <chr>   <chr>          <dbl>
## 1 rmse    standard       5.37 
## 2 rsq     standard       0.897
## 3 mae     standard       4.38
# 6) Observed vs Predicted plot (test set)
ggplot(test_predictions, aes(x = flood_risk_score, y = Pred)) +
  geom_point(alpha = 0.5,col="blue") +
  geom_abline(intercept = 0, slope = 1, linetype = "dashed", col="red") +
  labs(
    title = "Observed vs Predicted Flood Risk (Test Set)",
    x = "Observed flood_risk_score",
    y = "Predicted flood_risk_score"
  ) +
  theme_minimal()

# 7) Benchmark comparison (mean-only baseline on test set)
bench_mean <- mean(TrainData$flood_risk_score, na.rm = TRUE)

bench_predictions <- TestData %>%
  mutate(Pred = bench_mean)

bench_metrics <- bench_predictions %>%
  metrics(truth = flood_risk_score, estimate = Pred)

bench_metrics
## # A tibble: 3 × 3
##   .metric .estimator .estimate
##   <chr>   <chr>          <dbl>
## 1 rmse    standard        14.7
## 2 rsq     standard        NA  
## 3 mae     standard        11.8
# Optional: print side-by-side
list(
  Test_Metrics = test_metrics,
  Benchmark_Metrics = bench_metrics
)
## $Test_Metrics
## # A tibble: 3 × 3
##   .metric .estimator .estimate
##   <chr>   <chr>          <dbl>
## 1 rmse    standard       5.37 
## 2 rsq     standard       0.897
## 3 mae     standard       4.38 
## 
## $Benchmark_Metrics
## # A tibble: 3 × 3
##   .metric .estimator .estimate
##   <chr>   <chr>          <dbl>
## 1 rmse    standard        14.7
## 2 rsq     standard        NA  
## 3 mae     standard        11.8
  • The tidymodels implementation reproduces the performance of the classical multiple linear regression model, yielding very similar test-set accuracy (RMSE ≈ 5.37, R² ≈ 0.90).
  • Cross-validation confirms that the model generalizes well, with low variability across folds and stable performance estimates.
  • Compared to the mean-only benchmark, the tidymodels linear regression achieves a substantial reduction in prediction error, confirming that the model captures meaningful structure in the data.

2 Second task: Generalized Linear Models (GLM)

2.1 Motivation for Generalized Linear Models

Classical linear regression relies on strong assumptions regarding the distribution of the response variable, most notably normality, constant variance, and linearity of the conditional mean. However, many real-world response variables violate these assumptions, particularly when the outcome is binary, categorical, or represents counts or proportions.

Generalized Linear Models (GLMs) extend the classical linear regression framework by allowing the response variable to follow a distribution from the exponential family and by introducing a link function that relates the expected value of the response to a linear predictor. This flexibility enables the modelling of non-Gaussian outcomes while retaining an interpretable regression structure.

In the context of flood risk analysis, certain variables of interest describe discrete events, such as whether a flood occurred or not, rather than continuous magnitudes. For such outcomes, linear regression is not appropriate, as it may produce predictions outside the admissible range and violate distributional assumptions. GLMs, and in particular logistic regression for binary responses, provide a principled framework to model event probabilities while ensuring valid predictions and interpretable effects of explanatory variables.

2.2 Choice of the Response Variable

In the context of Generalized Linear Models, the choice of the response variable is closely tied to the nature of the data and the type of phenomenon under study. Unlike the previous section, where a continuous flood risk index was modelled using linear regression, this part of the analysis focuses on the occurrence of flood events as a binary outcome.

Among the available variables, the response variable selected for the GLM analysis is flood_occurrence_current_event, which indicates whether a flood event occurred at a given location during the simulated current period. This variable takes two possible values (“Yes” or “No”), making it naturally suited for a binary response model.

Modelling flood occurrence rather than flood severity allows us to address a different but complementary research question: instead of estimating how severe flood risk is, we aim to quantify the probability that a flood event occurs given environmental, geographical, and socio-demographic conditions. This perspective is particularly relevant for early warning systems and risk classification tasks, where predicting the likelihood of an event is sometimes more important than predicting its magnitude.

2.3 Model Specification for the Generalized Linear Model

In this section, we specify a Generalized Linear Model to analyse the probability of flood occurrence. Given the binary nature of the response variable, flood_occurrence_current_event, we adopt a binomial GLM with a logistic link function.

Let \(Y_i\) denote the indicator of flood occurrence at location \(i\), where \(Y_i = 1\) if a flood occurred and \(Y_i = 0\) otherwise.

The model assumes that
\[ Y_i \sim \text{Bernoulli}(p_i), \] where \(p_i = P(Y_i = 1 \mid \mathbf{x}_i)\) is the conditional probability of a flood event.

The relationship between the predictors and the response is modelled through the logit link function: \[ \log\left(\frac{p_i}{1 - p_i}\right) = \beta_0 + \beta_1 x_{i1} + \cdots + \beta_p x_{ip}. \]

This formulation ensures that predicted probabilities lie in the interval \([0,1]\) and allows the effects of explanatory variables to be interpreted in terms of log-odds. The selected predictors could include hydrological, geographical and socio-environmental variables previously identified.

2.4 Data Import and Train-Test Split for GLM

As we explained previously, in contrast to the previous linear regression analysis, where the response variable was continuous, the Generalized Linear Model focuses on modelling the occurrence of flood events as a binary outcome. For this reason, the data partitioning strategy is adapted accordingly.

The dataset is divided into training and test sets using an 80/20 split, stratified by the binary response variable flood_occurrence_current_event.

Stratification ensures that the proportion of flood and non-flood observations is preserved across both subsets, which is essential for reliable estimation and evaluation of probabilistic classification models. This approach prevents imbalances in class frequencies that could otherwise bias model fitting and performance assessment.

set.seed(123)

# Importing data
srilanka <- read_csv("sri_lanka_flood_risk_dataset_25000.csv")

# Ensuring that the GLM response is a factor with reference level = "No"
srilanka <- srilanka %>%
  mutate(
    flood_occurrence_current_event = factor(
      flood_occurrence_current_event,
      levels = c("No", "Yes")
    )
  )


# Stratified 80/20 train–test split (stratify by the GLM response)

data_split_glm <- initial_split(
  srilanka,
  prop   = 0.8,
  strata = flood_occurrence_current_event
)

TrainData <- training(data_split_glm)
TestData  <- testing(data_split_glm)

# Balance verification

tibble(
  Set = c("Train", "Test"),
  n = c(nrow(TrainData), nrow(TestData)),
  event_rate = c(mean(TrainData$flood_occurrence_current_event == "Yes"),
                 mean(TestData$flood_occurrence_current_event == "Yes"))
)
## # A tibble: 2 × 3
##   Set       n event_rate
##   <chr> <int>      <dbl>
## 1 Train 20000     0.0973
## 2 Test   5000     0.105

All subsequent exploratory analysis and model estimation for the GLM are conducted on the training set, while the test set is reserved exclusively for out-of-sample evaluation.

2.5 Exploratory Analysis for the Generalized Linear Model

2.5.1 Response variable sanity check and class balance

Before conducting a detailed exploratory analysis of the predictors, we first assess the distribution of the response variable in the training and test sets. As shown in the previous subsection, the binary response variable flood_occurrence_current_event is correctly encoded with levels “No” and “Yes”, and the data are partitioned using a stratified 80/20 train–test split.

The resulting class proportions are well preserved across both subsets, with an event rate of approximately 9.7% in the training set and 10.5% in the test set. This confirms that the stratification procedure has successfully maintained the relative frequency of flood events.

2.5.2 Distribution of the Binary Response Variable

In this subsection, we examine the empirical distribution of the binary response variable flood_occurrence_current_event in the training set. This preliminary inspection is essential to understand the baseline frequency of flood events and to assess the degree of class imbalance, which directly impacts the interpretation and performance of probabilistic classification models.

Since logistic regression models the probability of an event occurring, an imbalanced response distribution is common and does not invalidate the analysis. However, documenting the event rate provides important context for model estimation, coefficient interpretation, and subsequent performance evaluation.

# Distribution of the binary response variable in the training set
TrainData %>%
  count(flood_occurrence_current_event) %>%
  mutate(
    proportion = n / sum(n)
  )
## # A tibble: 2 × 3
##   flood_occurrence_current_event     n proportion
##   <fct>                          <int>      <dbl>
## 1 No                             18054     0.903 
## 2 Yes                             1946     0.0973
# Bar plot of flood occurrence
ggplot(TrainData, aes(x = flood_occurrence_current_event)) +
  geom_bar(fill = "skyblue", alpha = 0.7) +
  labs(
    title = "Distribution of Flood Occurrence in the Training Set",
    x = "Flood occurrence (current event)",
    y = "Number of observations"
  ) +
  theme_minimal()

2.5.3 Distribution of Key Predictors by Flood Occurrence

To better understand the relationship between the explanatory variables and the probability of flood occurrence, we examine how different key predictors are distributed conditional on the binary response variable. Unlike the exploratory analysis conducted for linear regression, the focus here is not on linearity or normality, but on identifying systematic differences in predictor distributions between flood and non-flood observations.

Inspecting predictor behaviour across outcome classes provides valuable insight into their potential discriminative power and helps assess whether higher or lower values of a given variable are associated with an increased likelihood of flood occurrence. This analysis also supports subsequent model specification and interpretation of regression coefficients in the logistic regression framework.

We begin by comparing the distribution of elevation across flood and non-flood observations. Elevation is a fundamental geographic determinant of flood susceptibility, as lower-lying areas are more prone to water accumulation and overflow. Examining elevation conditional on flood occurrence allows us to assess whether flood events are systematically associated with lower altitudes.

Due to the highly right-skewed distribution of elevation and the presence of extreme values, we re-visualise elevation on a logarithmic scale to improve the comparison between flood and non-flood observations.

# Distribution of elevation by flood occurrence (log scale)
ggplot(
  TrainData,
  aes(x = flood_occurrence_current_event, y = elevation_m)
) +
  geom_boxplot(fill = "skyblue", alpha = 0.6) +
  scale_y_log10() +
  labs(
    title = "Elevation by Flood Occurrence (Training Set, Log Scale)",
    x = "Flood occurrence (current event)",
    y = "Elevation (meters, log scale)"
  ) +
  theme_minimal()

When visualized on a logarithmic scale, elevation exhibits a clear systematic difference between flood and non-flood observations. Flood events are associated with lower elevations on average, with the entire distribution for flood occurrences shifted downward relative to non-flood cases. Although substantial overlap remains between the two groups, the observed pattern confirms elevation as a meaningful predictor of flood occurrence.

After analyzing elevation, we now examine the distribution of the distance to the nearest river conditional on flood occurrence. By comparing the distributions of distance to river for flood and non-flood observations, we assess whether flood events tend to occur systematically closer to river networks.

# Distribution of distance to nearest river by flood occurrence
ggplot(
  TrainData,
  aes(x = flood_occurrence_current_event, y = distance_to_river_m)
) +
  geom_boxplot(fill = "skyblue", alpha = 0.6) +
  labs(
    title = "Distance to Nearest River by Flood Occurrence (Training Set)",
    x = "Flood occurrence (current event)",
    y = "Distance to nearest river (meters)"
  ) +
  theme_minimal()

Due to the strongly right-skewed nature of distances to river and the presence of extreme values, we further examine the relationship between distance to river and flood occurrence by estimating empirical flood rates across distance intervals.

TrainData %>%
  mutate(
    distance_bin = cut(
      distance_to_river_m,
      breaks = quantile(distance_to_river_m, probs = seq(0, 1, 0.1), na.rm = 
                          TRUE),
      include.lowest = TRUE
    )
  ) %>%
  group_by(distance_bin) %>%
  summarise(
    flood_rate = mean(flood_occurrence_current_event == "Yes"),
    n = n()
  ) %>%
  ggplot(aes(x = distance_bin, y = flood_rate)) +
  geom_point(color = "skyblue") +
  geom_line(group = 1, color = "skyblue") +
  labs(
    title = "Empirical Flood Probability by Distance to River (Training Set)",
    x = "Distance to nearest river (binned)",
    y = "Flood occurrence rate"
  ) +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

The distribution of distance to the nearest river exhibits substantial overlap between flood and non-flood observations, indicating limited marginal separation when considered in isolation. However, the analysis of empirical flood probabilities across distance intervals reveals a clear decreasing trend in flood occurrence as distance to river networks increases. Locations closest to rivers display markedly higher flood probabilities, while the event rate progressively declines for more distant areas.

These findings suggest that proximity to rivers contributes to flood risk in a probabilistic manner and supports the inclusion of distance_to_river_m as a relevant explanatory variable in the subsequent logistic regression models, despite its limited standalone discriminative power.

After analyzing topographic and hydrological proximity variables, we now examine the distribution of recent rainfall conditional on flood occurrence.

# Distribution of rainfall over the last 7 days by flood occurrence
ggplot(
  TrainData,
  aes(x = flood_occurrence_current_event, y = rainfall_7d_mm)
) +
  geom_boxplot(fill = "skyblue", alpha = 0.6) +
  labs(
    title = "Rainfall over Last 7 Days by Flood Occurrence (Training Set)",
    x = "Flood occurrence (current event)",
    y = "Rainfall over last 7 days (mm)"
  ) +
  theme_minimal()

The distribution of rainfall accumulated over the previous seven days exhibits a clear separation between flood and non-flood observations. Locations where a flood event occurred show substantially higher median rainfall levels and a generally upward-shifted distribution compared to non-flood locations.

This pattern indicates a strong positive association between recent rainfall intensity and the probability of flood occurrence. While variability and extreme values are present in both groups, flood events are concentrated at markedly higher rainfall levels, supporting the role of short-term precipitation as a key driver of flooding.

These findings provide strong empirical justification for including rainfall_7d_mm as an explanatory variable in the subsequent binomial GLM, where its effect can be formally quantified in terms of changes in flood occurrence probability.

After analysing short-term precipitation through rainfall accumulated over the previous seven days, we now examine the distribution of monthly accumulated rainfall conditional on flood occurrence.

# Distribution of monthly rainfall by flood occurrence
ggplot(
  TrainData,
  aes(x = flood_occurrence_current_event, y = monthly_rainfall_mm)
) +
  geom_boxplot(fill = "skyblue", alpha = 0.6) +
  labs(
    title = "Monthly Rainfall by Flood Occurrence (Training Set)",
    x = "Flood occurrence (current event)",
    y = "Monthly rainfall (mm)"
  ) +
  theme_minimal()

The distribution of monthly accumulated rainfall exhibits a clear separation between flood and non-flood observations. While both groups display right-skewed distributions with some extreme values, the upward shift of the entire distribution for flood observations suggests that monthly rainfall captures longer-term hydrological saturation effects that complement short-term rainfall indicators. This systematic difference supports the inclusion of monthly_rainfall_mm as a key explanatory variable in the subsequent logistic regression models.

After analysing key continuous predictors, we now examine how different categorical variables are distributed across flood and non-flood observations. For categorical predictors, the goal is not to assess linear trends, but to identify whether certain categories are disproportionately associated with flood occurrence. Such differences provide insight into potential structural or environmental risk factors and inform the interpretation of coefficients in the logistic regression model.

# Distribution of land cover by flood occurrence
ggplot(TrainData, aes(x = landcover, fill = flood_occurrence_current_event)) +
  geom_bar(position = "fill", alpha = 0.8) +
  scale_y_continuous(labels = scales::percent) +
  labs(
    title = "Flood Occurrence by Land Cover Type (Training Set)",
    x = "Land cover type",
    y = "Proportion within land cover",
    fill = "Flood occurrence"
  ) +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 30, hjust = 1))

# Distribution of flood occurrence by urban/rural classification
ggplot(TrainData, aes(x = urban_rural, fill = flood_occurrence_current_event)) +
  geom_bar(position = "fill", alpha = 0.8) +
  scale_y_continuous(labels = scales::percent) +
  labs(
    title = "Flood Occurrence by Urban–Rural Classification (Training Set)",
    x = "Urban / Rural",
    y = "Proportion within category",
    fill = "Flood occurrence"
  ) +
  theme_minimal()

# Distribution of flood occurrence by soil type
ggplot(TrainData, aes(x = soil_type, fill = flood_occurrence_current_event)) +
  geom_bar(position = "fill", alpha = 0.8) +
  scale_y_continuous(labels = scales::percent) +
  labs(
    title = "Flood Occurrence by Soil Type (Training Set)",
    x = "Soil type",
    y = "Proportion within soil type",
    fill = "Flood occurrence"
  ) +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 30, hjust = 1))

  • Regarding land cover, flood events are observed across all categories, with slightly higher proportions in land-cover types associated with reduced infiltration or proximity to water bodies, such as wetlands and urban areas. Although differences are not extreme, the non-uniform distribution suggests that land cover may contribute to flood susceptibility through surface runoff and land-use characteristics.

  • The urban–rural classification shows a marginally higher proportion of flood occurrences in urban locations compared to rural ones. This pattern is consistent with the presence of impervious surfaces and altered drainage systems in urban environments, which can increase runoff accumulation during heavy rainfall events.

  • For soil type, flood occurrence rates appear broadly similar across categories, with only mild variation. Nevertheless, soils with lower permeability, such as clay or peaty soils, exhibit slightly higher flood proportions, which aligns with hydrological expectations regarding infiltration capacity.

2.6 Fitting Binomial Generalized Linear Models

2.6.1 Baseline Model: Intercept-Only Logistic Regression

As a starting point, we fit an intercept-only binomial Generalized Linear Model to estimate the unconditional probability of flood occurrence. This baseline model does not include any explanatory variables and serves as a reference against which more complex models can be compared.

Let \(Y_i\) denote the binary indicator of flood occurrence at location \(i\), where
\(Y_i = 1\) if a flood event occurred and \(Y_i = 0\) otherwise. The model assumes that \[ Y_i \sim \text{Bernoulli}(p), \] where \(p = P(Y_i = 1)\) is constant across all observations.

Using a binomial Generalized Linear Model with a logistic link function, the model is specified as \[ \log\left(\frac{p}{1 - p}\right) = \beta_0, \] where \(\beta_0\) is the intercept term. The estimated intercept therefore corresponds to the log-odds of a flood event occurring in the training data.

By applying the inverse logit transformation, the unconditional flood probability is obtained as \[ p = \frac{\exp(\beta_0)}{1 + \exp(\beta_0)}. \]

This baseline model provides a benchmark level of predictive performance and establishes a reference point for evaluating the incremental explanatory power of introducing covariates in subsequent logistic regression models.

# Baseline intercept-only logistic regression
glm_null <- glm(
  flood_occurrence_current_event ~ 1,
  data = TrainData,
  family = binomial(link = "logit")
)

summary(glm_null)
## 
## Call:
## glm(formula = flood_occurrence_current_event ~ 1, family = binomial(link = "logit"), 
##     data = TrainData)
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -2.22759    0.02386  -93.36   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 12764  on 19999  degrees of freedom
## Residual deviance: 12764  on 19999  degrees of freedom
## AIC: 12766
## 
## Number of Fisher Scoring iterations: 5
# Convert log-odds to probability
p_hat <- plogis(coef(glm_null))
p_hat
## (Intercept) 
##      0.0973

The intercept-only logistic regression model estimates an unconditional flood occurrence probability of approximately 9.73%, which coincides with the empirical event rate observed in the training data. This confirms the correct specification of the binomial GLM and provides a baseline level of model fit against which all subsequent models will be compared.

2.6.2 Building Simple Logistic Regression Models

We next consider a series of simple binomial Generalized Linear Models, each including a single explanatory variable. These models allow us to evaluate the individual association between each predictor and the probability of flood occurrence, while maintaining a parsimonious specification.

For each predictor, we estimate a binomial GLM with a logistic link function and compare its fit to the intercept-only baseline model. This stepwise approach provides insight into the marginal explanatory power of individual covariates and facilitates the interpretation of regression coefficients in terms of log-odds and odds ratios.

Let \(Y_i\) denote the binary indicator of flood occurrence at location \(i\), and let \(x_i\) be a single explanatory variable. The model is specified as

\[ Y_i \sim \text{Bernoulli}(p_i), \qquad \log\left(\frac{p_i}{1 - p_i}\right) = \beta_0 + \beta_1 x_i. \]

Here, \(\beta_1\) represents the change in the log-odds of flood occurrence associated with a one-unit increase in the predictor \(x_i\). Positive values of \(\beta_1\) indicate an increased probability of flood occurrence, while negative values indicate a reduced probability.

We begin the sequence of simple logistic regression models by considering elevation_m as the sole explanatory variable.

# Simple logistic regression with elevation only
glm_elev <- glm(
  flood_occurrence_current_event ~ elevation_m,
  data   = TrainData,
  family = binomial(link = "logit")
)

summary(glm_elev)
## 
## Call:
## glm(formula = flood_occurrence_current_event ~ elevation_m, family = binomial(link = "logit"), 
##     data = TrainData)
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -2.131e+00  2.729e-02 -78.113  < 2e-16 ***
## elevation_m -5.783e-04  9.015e-05  -6.415 1.41e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 12764  on 19999  degrees of freedom
## Residual deviance: 12717  on 19998  degrees of freedom
## AIC: 12721
## 
## Number of Fisher Scoring iterations: 5
# Odds ratios and 95% confidence intervals
exp(cbind(
  OR  = coef(glm_elev),
  confint(glm_elev)
))
##                    OR     2.5 %    97.5 %
## (Intercept) 0.1186688 0.1124544 0.1251503
## elevation_m 0.9994219 0.9992413 0.9995947

We can extract the following conclusions:

  • Elevation is a statistically significant predictor of flood occurrence (\(p < 10^{-9}\)), with a negative effect on the probability of flooding.

  • The estimated odds ratio for elevation_m is 0.9994 (95% CI: [0.99924, 0.99959]), indicating that higher elevations are associated with lower flood risk.

  • Although the effect per meter is small, cumulative elevation differences translate into substantial changes in flood probability.

  • The model improves upon the intercept-only baseline, as evidenced by a reduction in deviance and AIC.

  • These results are consistent with hydrological theory and confirm elevation as a relevant explanatory variable for flood occurrence.

We now fit a simple logistic regression model using distance_to_river_m as the sole explanatory variable to assess its effect on the probability of flood occurrence.

# Simple logistic regression with distance to nearest river
glm_river <- glm(
  flood_occurrence_current_event ~ distance_to_river_m,
  data   = TrainData,
  family = binomial(link = "logit")
)

summary(glm_river)
## 
## Call:
## glm(formula = flood_occurrence_current_event ~ distance_to_river_m, 
##     family = binomial(link = "logit"), data = TrainData)
## 
## Coefficients:
##                       Estimate Std. Error z value Pr(>|z|)    
## (Intercept)         -1.954e+00  3.396e-02  -57.53   <2e-16 ***
## distance_to_river_m -1.527e-04  1.506e-05  -10.14   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 12764  on 19999  degrees of freedom
## Residual deviance: 12644  on 19998  degrees of freedom
## AIC: 12648
## 
## Number of Fisher Scoring iterations: 5
# Odds ratios with 95% confidence intervals
exp(cbind(
  OR = coef(glm_river),
  confint(glm_river)
))
##                            OR     2.5 %    97.5 %
## (Intercept)         0.1417118 0.1325513 0.1514269
## distance_to_river_m 0.9998473 0.9998173 0.9998764

The simple logistic regression model using distance_to_river_m as the sole predictor shows a strong and statistically significant association with flood occurrence:

  • The estimated coefficient for distance_to_river_m is negative and highly significant (\(p < 2 \times 10^{-16}\)), indicating that flood probability decreases as distance from the nearest river increases.

  • The odds ratio is approximately \(0.99985\) per additional meter, implying a small effect at the unit level, but a substantial cumulative effect over hundreds or thousands of meters.

  • Compared to the intercept-only model, the residual deviance decreases markedly and the AIC is substantially lower, indicating a clear improvement in model fit.

We now consider short-term rainfall (7-day accumulation) and longer-term rainfall (monthly accumulation) as explanatory variables, fitting separate logistic regression models to assess and compare their individual contributions to the probability of flood occurrence.

# Simple logistic regression with rainfall over the last 7 days
glm_rain7 <- glm(
  flood_occurrence_current_event ~ rainfall_7d_mm,
  data   = TrainData,
  family = binomial(link = "logit")
)

summary(glm_rain7)
## 
## Call:
## glm(formula = flood_occurrence_current_event ~ rainfall_7d_mm, 
##     family = binomial(link = "logit"), data = TrainData)
## 
## Coefficients:
##                  Estimate Std. Error z value Pr(>|z|)    
## (Intercept)    -3.6479677  0.0528156  -69.07   <2e-16 ***
## rainfall_7d_mm  0.0141256  0.0003902   36.20   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 12764  on 19999  degrees of freedom
## Residual deviance: 11377  on 19998  degrees of freedom
## AIC: 11381
## 
## Number of Fisher Scoring iterations: 5
# Odds ratios and 95% confidence intervals
exp(cbind(
  OR  = coef(glm_rain7),
  confint(glm_rain7)
))
##                      OR      2.5 %     97.5 %
## (Intercept)    0.026044 0.02346018 0.02885738
## rainfall_7d_mm 1.014226 1.01345347 1.01500484
# Simple logistic regression with monthly rainfall
glm_rain_month <- glm(
  flood_occurrence_current_event ~ monthly_rainfall_mm,
  data   = TrainData,
  family = binomial(link = "logit")
)

summary(glm_rain_month)
## 
## Call:
## glm(formula = flood_occurrence_current_event ~ monthly_rainfall_mm, 
##     family = binomial(link = "logit"), data = TrainData)
## 
## Coefficients:
##                       Estimate Std. Error z value Pr(>|z|)    
## (Intercept)         -4.6863273  0.0711544  -65.86   <2e-16 ***
## monthly_rainfall_mm  0.0087855  0.0001996   44.03   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 12764  on 19999  degrees of freedom
## Residual deviance: 10311  on 19998  degrees of freedom
## AIC: 10315
## 
## Number of Fisher Scoring iterations: 6
# Odds ratios and 95% confidence intervals
exp(cbind(
  OR  = coef(glm_rain_month),
  confint(glm_rain_month)
))
##                              OR       2.5 %     97.5 %
## (Intercept)         0.009220489 0.008007769 0.01058419
## monthly_rainfall_mm 1.008824168 1.008432539 1.00922176

According to the results, we can deduce that:

  • Both 7-day rainfall and monthly rainfall are highly statistically significant predictors of flood occurrence (p < 2e-16), confirming the central role of precipitation in flood generation.

  • For rainfall over the last 7 days, the estimated odds ratio (≈ 1.014) indicates that each additional millimetre of recent rainfall increases the odds of a flood event by about 1.4%, holding all else constant.

  • For monthly rainfall, the odds ratio (≈ 1.009) implies a smaller but cumulative effect, where sustained higher rainfall levels increase flood risk through longer-term soil saturation and hydrological stress.

  • The monthly rainfall model achieves a substantially lower AIC (10315) than the 7-day rainfall model (11381), indicating a better overall model fit when longer-term accumulated rainfall is used as the sole predictor.

2.6.3 Building Multiple Logistic Regression Models

As a next step, we extend the simple logistic regression framework to a multivariate setting by jointly including multiple explanatory variables within the same binomial Generalized Linear Model. These multiple logistic regression models allow us to assess the conditional effect of each predictor on the probability of flood occurrence while controlling for the influence of the remaining covariates.

By considering several predictors simultaneously, this approach enables us to evaluate whether variables that are individually significant retain their explanatory power once potential confounding effects are accounted for. In addition, it allows us to examine how the magnitude and statistical significance of regression coefficients change relative to the corresponding univariate models.

Let \(Y_i\) denote the binary indicator of flood occurrence at location \(i\), and let \(\mathbf{x}_i = (x_{i1}, \ldots, x_{ip})\) denote a vector of explanatory variables. The multiple logistic regression model is specified as \[ Y_i \sim \text{Bernoulli}(p_i), \qquad \log\!\left(\frac{p_i}{1 - p_i}\right) = \beta_0 + \beta_1 x_{i1} + \cdots + \beta_p x_{ip}. \]

In this formulation, each coefficient \(\beta_j\) represents the change in the log-odds of flood occurrence associated with a one-unit increase in predictor \(x_{ij}\), holding all other variables constant. This multivariate perspective provides a more realistic representation of flood risk dynamics, where topographical, hydrological, and meteorological factors act jointly rather than in isolation.

We begin by jointly considering topographical and hydrological proximity factors. Specifically, we include elevation and distance to the nearest river as simultaneous explanatory variables in order to assess their combined and conditional effects on the probability of flood occurrence.

This model allows us to evaluate whether each predictor retains explanatory power after accounting for the other and to examine potential changes in magnitude or significance relative to the corresponding univariate models.

# Multiple logistic regression: elevation + distance to river
glm_geo <- glm(
  flood_occurrence_current_event ~ elevation_m + distance_to_river_m,
  data   = TrainData,
  family = binomial(link = "logit")
)

summary(glm_geo)
## 
## Call:
## glm(formula = flood_occurrence_current_event ~ elevation_m + 
##     distance_to_river_m, family = binomial(link = "logit"), data = TrainData)
## 
## Coefficients:
##                       Estimate Std. Error z value Pr(>|z|)    
## (Intercept)         -1.857e+00  3.652e-02 -50.862  < 2e-16 ***
## elevation_m         -5.796e-04  9.040e-05  -6.411 1.45e-10 ***
## distance_to_river_m -1.528e-04  1.507e-05 -10.144  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 12764  on 19999  degrees of freedom
## Residual deviance: 12596  on 19997  degrees of freedom
## AIC: 12602
## 
## Number of Fisher Scoring iterations: 5
# Odds ratios and 95% confidence intervals
exp(cbind(
  OR = coef(glm_geo),
  confint(glm_geo)
))
##                            OR     2.5 %    97.5 %
## (Intercept)         0.1560772 0.1452581 0.1676152
## elevation_m         0.9994206 0.9992395 0.9995939
## distance_to_river_m 0.9998472 0.9998172 0.9998763

According to the results, we can deduce that:

  • Both elevation and distance to the nearest river remain highly statistically significant predictors of flood occurrence (p < 2e-16) when included jointly in the model, indicating that each variable contributes independent explanatory information.

  • The estimated odds ratio for elevation_m (≈ 0.99942) confirms a negative association between elevation and flood risk, implying that higher elevations are associated with lower odds of flood occurrence, even after controlling for river proximity.

  • The odds ratio for distance_to_river_m (≈ 0.99985) indicates that increasing distance from the nearest river reduces flood risk, with a small per-metre effect that accumulates substantially over hundreds or thousands of metres.

  • Compared to the corresponding single-predictor models, the estimated coefficients remain stable in magnitude and significance, suggesting limited confounding between elevation and river proximity.

  • The multivariate model achieves a lower AIC (12602) than both univariate models, indicating an improved overall model fit and supporting the joint inclusion of topographical elevation and hydrological proximity as key determinants of flood occurrence.

We now fit a multivariate binomial Generalized Linear Model that jointly incorporates topographic, hydrological proximity, and meteorological predictors. Specifically, elevation, distance to the nearest river, and short-term rainfall accumulation are included as explanatory variables.

This model allows us to evaluate the combined and independent contributions of terrain, river proximity, and recent precipitation to flood occurrence, while controlling for potential confounding effects among these key drivers.

# Multivariate logistic regression:
# elevation + distance to river + short-term rainfall
glm_env_rain <- glm(
  flood_occurrence_current_event ~
    elevation_m +
    distance_to_river_m +
    rainfall_7d_mm,
  data   = TrainData,
  family = binomial(link = "logit")
)

summary(glm_env_rain)
## 
## Call:
## glm(formula = flood_occurrence_current_event ~ elevation_m + 
##     distance_to_river_m + rainfall_7d_mm, family = binomial(link = "logit"), 
##     data = TrainData)
## 
## Coefficients:
##                       Estimate Std. Error z value Pr(>|z|)    
## (Intercept)         -3.269e+00  5.917e-02 -55.244  < 2e-16 ***
## elevation_m         -6.455e-04  9.374e-05  -6.886 5.73e-12 ***
## distance_to_river_m -1.641e-04  1.571e-05 -10.445  < 2e-16 ***
## rainfall_7d_mm       1.432e-02  3.951e-04  36.256  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 12764  on 19999  degrees of freedom
## Residual deviance: 11196  on 19996  degrees of freedom
## AIC: 11204
## 
## Number of Fisher Scoring iterations: 6
# Odds ratios and 95% confidence intervals
exp(cbind(
  OR = coef(glm_env_rain),
  confint(glm_env_rain)
))
##                             OR      2.5 %     97.5 %
## (Intercept)         0.03804534 0.03385139 0.04268952
## elevation_m         0.99935472 0.99916728 0.99953469
## distance_to_river_m 0.99983595 0.99980474 0.99986631
## rainfall_7d_mm      1.01442796 1.01364577 1.01521708

According to the results, we can deduce that:

  • Although elevation, distance to river, and short-term rainfall are all individually and jointly statistically significant predictors, the combined model does not outperform the monthly rainfall–only model in terms of AIC. The substantially lower AIC obtained (10315) for the monthly rainfall model indicates that long-term accumulated precipitation alone captures most of the information relevant for explaining flood occurrence in the data.

  • This result suggests that monthly rainfall acts as a dominant explanatory variable, effectively summarizing broader hydrological conditions, while the additional topographic and proximity variables provide limited incremental improvement in model fit once monthly precipitation is accounted for.

We now consider a broader set of potential explanatory variables in order to construct more complex logistic regression models. While it is well understood that increasing the number of predictors does not necessarily lead to improved predictive performance, this exploratory step allows us to assess whether additional environmental, socio-demographic, and infrastructural variables provide complementary information beyond the core predictors identified earlier.

Note: The variable flood_risk_score is not used as a predictor in the GLM analysis because it is a derived composite index constructed from several environmental and socio-hydrological variables. Including it would introduce circularity and compromise the interpretability of the estimated effects on flood occurrence probability.

We begin by fitting a multivariate binomial Generalized Linear Model that incorporates key environmental and hydrological predictors previously identified as relevant in both the exploratory analysis and the linear regression task. This model includes terrain elevation, proximity to rivers, short and long-term rainfall accumulation, and local drainage conditions.

glm_M1_env <- glm(
  flood_occurrence_current_event ~
    elevation_m +
    distance_to_river_m +
    rainfall_7d_mm +
    monthly_rainfall_mm +
    drainage_index,
  data = TrainData,
  family = binomial(link = "logit")
)

summary(glm_M1_env)
## 
## Call:
## glm(formula = flood_occurrence_current_event ~ elevation_m + 
##     distance_to_river_m + rainfall_7d_mm + monthly_rainfall_mm + 
##     drainage_index, family = binomial(link = "logit"), data = TrainData)
## 
## Coefficients:
##                       Estimate Std. Error z value Pr(>|z|)    
## (Intercept)         -6.164e+00  1.287e-01 -47.908  < 2e-16 ***
## elevation_m         -8.296e-04  1.078e-04  -7.699 1.37e-14 ***
## distance_to_river_m -2.115e-04  1.808e-05 -11.700  < 2e-16 ***
## rainfall_7d_mm       1.932e-02  5.016e-04  38.507  < 2e-16 ***
## monthly_rainfall_mm  1.093e-02  2.440e-04  44.781  < 2e-16 ***
## drainage_index      -1.309e+00  1.304e-01 -10.039  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 12764.4  on 19999  degrees of freedom
## Residual deviance:  8215.8  on 19994  degrees of freedom
## AIC: 8227.8
## 
## Number of Fisher Scoring iterations: 7
exp(cbind(
  OR = coef(glm_M1_env),
  confint(glm_M1_env)
))
##                              OR       2.5 %      97.5 %
## (Intercept)         0.002102986 0.001629891 0.002699191
## elevation_m         0.999170773 0.998955974 0.999378179
## distance_to_river_m 0.999788506 0.999752597 0.999823453
## rainfall_7d_mm      1.019504871 1.018510260 1.020515361
## monthly_rainfall_mm 1.010986620 1.010507903 1.011475030
## drainage_index      0.270128887 0.209078840 0.348568019

We next extend the environmental–hydrological model by incorporating remotely sensed indices related to vegetation cover and surface water presence.

glm_M2_env_veg <- glm(
  flood_occurrence_current_event ~
    elevation_m +
    distance_to_river_m +
    rainfall_7d_mm +
    monthly_rainfall_mm +
    drainage_index +
    ndvi +
    ndwi,
  data = TrainData,
  family = binomial(link = "logit")
)

summary(glm_M2_env_veg)
## 
## Call:
## glm(formula = flood_occurrence_current_event ~ elevation_m + 
##     distance_to_river_m + rainfall_7d_mm + monthly_rainfall_mm + 
##     drainage_index + ndvi + ndwi, family = binomial(link = "logit"), 
##     data = TrainData)
## 
## Coefficients:
##                       Estimate Std. Error z value Pr(>|z|)    
## (Intercept)         -6.198e+00  1.305e-01 -47.502  < 2e-16 ***
## elevation_m         -8.311e-04  1.078e-04  -7.712 1.24e-14 ***
## distance_to_river_m -2.115e-04  1.808e-05 -11.696  < 2e-16 ***
## rainfall_7d_mm       1.933e-02  5.018e-04  38.511  < 2e-16 ***
## monthly_rainfall_mm  1.094e-02  2.442e-04  44.778  < 2e-16 ***
## drainage_index      -1.308e+00  1.304e-01 -10.030  < 2e-16 ***
## ndvi                 1.560e-01  1.054e-01   1.480    0.139    
## ndwi                 1.088e-01  1.093e-01   0.995    0.320    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 12764.4  on 19999  degrees of freedom
## Residual deviance:  8212.7  on 19992  degrees of freedom
## AIC: 8228.7
## 
## Number of Fisher Scoring iterations: 7
exp(cbind(
  OR = coef(glm_M2_env_veg),
  confint(glm_M2_env_veg)
))
##                              OR       2.5 %      97.5 %
## (Intercept)         0.002033085 0.001570075 0.002618685
## elevation_m         0.999169229 0.998954380 0.999376683
## distance_to_river_m 0.999788564 0.999752652 0.999823514
## rainfall_7d_mm      1.019513992 1.018519026 1.020524848
## monthly_rainfall_mm 1.010996688 1.010517512 1.011485583
## drainage_index      0.270286264 0.209173042 0.348815566
## ndvi                1.168883223 0.950724454 1.437229007
## ndwi                1.114903219 0.899902166 1.381181925

We now consider a comprehensive mixed-effects logistic regression model that combines continuous environmental, hydrological, and human exposure variables with categorical land-use and spatial classification factors. Land cover type, soil type, and urban–rural classification are included to capture structural differences in flood susceptibility across locations.

glm_M4_full <- glm(
  flood_occurrence_current_event ~
    elevation_m +
    distance_to_river_m +
    rainfall_7d_mm +
    monthly_rainfall_mm +
    drainage_index +
    population_density_per_km2 +
    built_up_percent +
    infrastructure_score +
    landcover +
    soil_type +
    urban_rural,
  data = TrainData,
  family = binomial(link = "logit")
)

summary(glm_M4_full)
## 
## Call:
## glm(formula = flood_occurrence_current_event ~ elevation_m + 
##     distance_to_river_m + rainfall_7d_mm + monthly_rainfall_mm + 
##     drainage_index + population_density_per_km2 + built_up_percent + 
##     infrastructure_score + landcover + soil_type + urban_rural, 
##     family = binomial(link = "logit"), data = TrainData)
## 
## Coefficients:
##                              Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                -5.792e+00  1.687e-01 -34.336  < 2e-16 ***
## elevation_m                -8.894e-04  1.090e-04  -8.157 3.44e-16 ***
## distance_to_river_m        -2.116e-04  1.816e-05 -11.656  < 2e-16 ***
## rainfall_7d_mm              1.951e-02  5.069e-04  38.488  < 2e-16 ***
## monthly_rainfall_mm         1.107e-02  2.475e-04  44.728  < 2e-16 ***
## drainage_index             -1.317e+00  1.313e-01 -10.029  < 2e-16 ***
## population_density_per_km2  4.004e-05  8.698e-05   0.460   0.6453    
## built_up_percent            3.868e-03  1.655e-03   2.337   0.0195 *  
## infrastructure_score       -1.328e-02  1.491e-03  -8.907  < 2e-16 ***
## landcoverBare Soil         -9.591e-02  1.467e-01  -0.654   0.5132    
## landcoverForest            -2.431e-02  9.340e-02  -0.260   0.7947    
## landcoverPlantation         7.924e-03  9.925e-02   0.080   0.9364    
## landcoverScrub             -1.723e-02  1.028e-01  -0.168   0.8670    
## landcoverUrban             -2.033e-01  1.370e-01  -1.485   0.1377    
## landcoverWetland            1.978e-01  1.057e-01   1.870   0.0615 .  
## soil_typeLoamy             -1.487e-01  9.480e-02  -1.568   0.1168    
## soil_typePeaty              1.271e-01  9.179e-02   1.384   0.1663    
## soil_typeSandy              7.102e-02  9.190e-02   0.773   0.4396    
## soil_typeSilty             -1.039e-02  9.240e-02  -0.112   0.9105    
## urban_ruralUrban            1.922e-01  1.275e-01   1.507   0.1317    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 12764.4  on 19999  degrees of freedom
## Residual deviance:  8100.8  on 19980  degrees of freedom
## AIC: 8140.8
## 
## Number of Fisher Scoring iterations: 7
exp(cbind(
  OR = coef(glm_M4_full),
  confint(glm_M4_full)
))
##                                     OR       2.5 %      97.5 %
## (Intercept)                0.003052187 0.002187128 0.004237125
## elevation_m                0.999111018 0.998893758 0.999320980
## distance_to_river_m        0.999788369 0.999752299 0.999823472
## rainfall_7d_mm             1.019700391 1.018695334 1.020721722
## monthly_rainfall_mm        1.011131510 1.010645972 1.011627073
## drainage_index             0.267936061 0.206992926 0.346379584
## population_density_per_km2 1.000040040 0.999868912 1.000209934
## built_up_percent           1.003875718 1.000619104 1.007134126
## infrastructure_score       0.986809240 0.983922478 0.989689626
## landcoverBare Soil         0.908541549 0.678023934 1.205517487
## landcoverForest            0.975986779 0.812007714 1.171141313
## landcoverPlantation        1.007955073 0.828762667 1.223078582
## landcoverScrub             0.982919916 0.802311663 1.200834744
## landcoverUrban             0.816018625 0.623834818 1.067247851
## landcoverWetland           1.218667392 0.989080371 1.497293738
## soil_typeLoamy             0.861842590 0.715561908 1.037730358
## soil_typePeaty             1.135476497 0.948613232 1.359543874
## soil_typeSandy             1.073607555 0.896692763 1.285707457
## soil_typeSilty             0.989663649 0.825735363 1.186262440
## urban_ruralUrban           1.211945815 0.943451069 1.555417048

Finally, we explore a targeted interaction model motivated by hydrological theory. Specifically, we include interactions between precipitation intensity and drainage capacity, as well as between short-term rainfall and river proximity. These interactions allow the effect of rainfall on flood occurrence to vary depending on local drainage conditions and hydrological connectivity.

glm_M5_interact <- glm(
  flood_occurrence_current_event ~
    elevation_m +
    distance_to_river_m +
    rainfall_7d_mm +
    monthly_rainfall_mm +
    drainage_index +
    rainfall_7d_mm:distance_to_river_m +
    monthly_rainfall_mm:drainage_index,
  data = TrainData,
  family = binomial(link = "logit")
)

summary(glm_M5_interact)
## 
## Call:
## glm(formula = flood_occurrence_current_event ~ elevation_m + 
##     distance_to_river_m + rainfall_7d_mm + monthly_rainfall_mm + 
##     drainage_index + rainfall_7d_mm:distance_to_river_m + monthly_rainfall_mm:drainage_index, 
##     family = binomial(link = "logit"), data = TrainData)
## 
## Coefficients:
##                                      Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                        -5.950e+00  2.067e-01 -28.789  < 2e-16 ***
## elevation_m                        -8.305e-04  1.077e-04  -7.709 1.27e-14 ***
## distance_to_river_m                -2.275e-04  3.794e-05  -5.998 2.00e-09 ***
## rainfall_7d_mm                      1.908e-02  6.815e-04  28.001  < 2e-16 ***
## monthly_rainfall_mm                 1.035e-02  5.272e-04  19.631  < 2e-16 ***
## drainage_index                     -1.706e+00  3.508e-01  -4.863 1.15e-06 ***
## distance_to_river_m:rainfall_7d_mm  1.292e-07  2.700e-07   0.479    0.632    
## monthly_rainfall_mm:drainage_index  1.206e-03  9.863e-04   1.222    0.222    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 12764.4  on 19999  degrees of freedom
## Residual deviance:  8214.1  on 19992  degrees of freedom
## AIC: 8230.1
## 
## Number of Fisher Scoring iterations: 7
exp(cbind(
  OR = coef(glm_M5_interact),
  confint(glm_M5_interact)
))
##                                             OR       2.5 %      97.5 %
## (Intercept)                        0.002606094 0.001730671 0.003891244
## elevation_m                        0.999169852 0.998955090 0.999377234
## distance_to_river_m                0.999772493 0.999696856 0.999845552
## rainfall_7d_mm                     1.019265388 1.017912582 1.020635772
## monthly_rainfall_mm                1.010404049 1.009369514 1.011458098
## drainage_index                     0.181619432 0.091212976 0.360803667
## distance_to_river_m:rainfall_7d_mm 1.000000129 0.999999603 1.000000662
## monthly_rainfall_mm:drainage_index 1.001206397 0.999272484 1.003143930

To facilitate interpretation, we summarise the fitted models in terms of their residual deviance, Akaike Information Criterion (AIC), and the percentage improvement in deviance relative to the null (intercept-only) model. The percentage improvement quantifies how much of the unexplained variability in flood occurrence is reduced by each model compared to the baseline with no predictors.

null_dev <- glm_null$deviance

model_comparison <- tibble(
  Model = c(
    "Null (Intercept-only)",
    "Elevation",
    "Distance to river",
    "Rainfall (7-day)",
    "Rainfall (monthly)",
    "Elevation + Distance",
    "Elevation + Distance + Rainfall (7-day)",
    "Env + Hydro (Elev + Dist + Rain + Drainage)",
    "Env + Hydro + Vegetation",
    "Comprehensive model (Env + Socio + Categorical)",
    "Env + Hydro + Interactions"
  ),
  Predictors = c(
    0, 1, 1, 1, 1, 2, 3, 5, 7, 11, 7
  ),
  Residual_Deviance = c(
    glm_null$deviance,
    glm_elev$deviance,
    glm_river$deviance,
    glm_rain7$deviance,
    glm_rain_month$deviance,
    glm_geo$deviance,
    glm_env_rain$deviance,
    glm_M1_env$deviance,
    glm_M2_env_veg$deviance,
    glm_M4_full$deviance,
    glm_M5_interact$deviance
  ),
  Improvement_Percent = c(
    0,
    (null_dev - glm_elev$deviance) / null_dev * 100,
    (null_dev - glm_river$deviance) / null_dev * 100,
    (null_dev - glm_rain7$deviance) / null_dev * 100,
    (null_dev - glm_rain_month$deviance) / null_dev * 100,
    (null_dev - glm_geo$deviance) / null_dev * 100,
    (null_dev - glm_env_rain$deviance) / null_dev * 100,
    (null_dev - glm_M1_env$deviance) / null_dev * 100,
    (null_dev - glm_M2_env_veg$deviance) / null_dev * 100,
    (null_dev - glm_M4_full$deviance) / null_dev * 100,
    (null_dev - glm_M5_interact$deviance) / null_dev * 100
  ),
  AIC = c(
    AIC(glm_null),
    AIC(glm_elev),
    AIC(glm_river),
    AIC(glm_rain7),
    AIC(glm_rain_month),
    AIC(glm_geo),
    AIC(glm_env_rain),
    AIC(glm_M1_env),
    AIC(glm_M2_env_veg),
    AIC(glm_M4_full), 
    AIC(glm_M5_interact)
  )
)

model_comparison_clean <- model_comparison %>%
  mutate(
    Residual_Deviance = round(Residual_Deviance, 1),
    Improvement_Percent = round(Improvement_Percent, 1),
    AIC = round(AIC, 1)
  )

kable(
  model_comparison_clean,
  align = "lcccc"
)
Model Predictors Residual_Deviance Improvement_Percent AIC
Null (Intercept-only) 0 12764.4 0.0 12766.4
Elevation 1 12716.8 0.4 12720.8
Distance to river 1 12643.7 0.9 12647.7
Rainfall (7-day) 1 11377.0 10.9 11381.0
Rainfall (monthly) 1 10311.2 19.2 10315.2
Elevation + Distance 2 12596.1 1.3 12602.1
Elevation + Distance + Rainfall (7-day) 3 11196.1 12.3 11204.1
Env + Hydro (Elev + Dist + Rain + Drainage) 5 8215.8 35.6 8227.8
Env + Hydro + Vegetation 7 8212.7 35.7 8228.7
Comprehensive model (Env + Socio + Categorical) 11 8100.8 36.5 8140.8
Env + Hydro + Interactions 7 8214.1 35.6 8230.1

Based on the comparative analysis of model fit and complexity summarised in the table, we select two models for further interpretation and predictive assessment.

The first selected model is the Comprehensive model (Environmental + Socio-demographic + Categorical predictors). This specification achieves the lowest residual deviance (8100.8) and the largest improvement over the null model (36.5%), indicating the strongest overall explanatory power. Although it includes a relatively large number of predictors, the corresponding AIC (8140.8) is the lowest among all considered models, suggesting that the gain in goodness of fit more than compensates for the increased model complexity. This model therefore represents the best-performing specification in terms of global fit and information criteria.

As a complementary and more parsimonious alternative, we also select the Environmental + Hydrological model (Elevation, Distance to River, Rainfall, Drainage Index). This model explains a substantial proportion of the null deviance (35.6% improvement) while relying on a considerably smaller set of predictors. Its residual deviance (8215.8) and AIC (8227.8) are only marginally higher than those of more complex specifications, indicating that most of the explanatory power is already captured by key environmental and hydrological drivers.

2.7 Interpretation of our Generalized Linear Models

In this section, we will interpret one of the selected binomial Generalized Linear Model using marginal effects and odds ratios. Rather than focusing solely on raw regression coefficients, we examine how the predicted probability of flood occurrence varies with key explanatory variables, while holding the remaining covariates fixed at representative values.

Based on the model comparison conducted in Section 2.6, we focus on the Environmental + Hydrological model, which includes elevation, distance to the nearest river, short-term rainfall (7-day accumulation), monthly rainfall and drainage conditions. This model achieves a substantial improvement in goodness of fit relative to the null model while maintaining a high level of interpretability and avoiding unnecessary complexity.

# First, we re-define our model
glm_M1_env <- glm(
  flood_occurrence_current_event ~ 
    elevation_m +
    distance_to_river_m +
    rainfall_7d_mm +
    monthly_rainfall_mm +
    drainage_index,
  data   = TrainData,
  family = binomial(link = "logit")
)

# Elevation
plot(effect("elevation_m", glm_M1_env),
     main = "Effect of Elevation on Flood Probability",
     ylab = "Predicted Probability of Flood")

# Distance to river
plot(effect("distance_to_river_m", glm_M1_env),
     main = "Effect of Distance to River on Flood Probability",
     ylab = "Predicted Probability of Flood")

# 7-day rainfall
plot(effect("rainfall_7d_mm", glm_M1_env),
     main = "Effect of 7-Day Rainfall on Flood Probability",
     ylab = "Predicted Probability of Flood")

# Monthly rainfall
plot(effect("monthly_rainfall_mm", glm_M1_env),
     main = "Effect of Monthly Rainfall on Flood Probability",
     ylab = "Predicted Probability of Flood")

# Drainage index
plot(effect("drainage_index", glm_M1_env),
     main = "Effect of Drainage Index on Flood Probability",
     ylab = "Predicted Probability of Flood")

According to the plots, we can deduce that:

  • Elevation shows a clear negative marginal effect: holding all other variables constant, higher elevations are associated with a systematically lower predicted probability of flood occurrence.

  • Distance to the nearest river also exhibits a strong negative effect, with flood probability decreasing steadily as distance from river networks increases, confirming the importance of hydrological proximity.

  • Short-term rainfall (7-day accumulation) has a strong positive and nearly linear marginal effect on flood probability, indicating that recent intense precipitation sharply increases the likelihood of flooding.

  • Monthly rainfall displays a similarly strong positive effect, capturing longer-term hydrological saturation processes that substantially raise flood risk even after controlling for short-term rainfall.

  • Drainage index has a pronounced negative effect: locations with better drainage capacity (higher index values) are associated with significantly lower predicted flood probabilities.

Overall, the marginal effects plots confirm that all five predictors contribute meaningfully and in the expected directions to flood occurrence risk, with precipitation variables exerting the strongest influence, followed by drainage conditions and topographic–hydrological factors.

2.8 Out-of-Sample Prediction and Model Performance Assessment

In this final section, we evaluate the predictive behaviour of the selected Generalized Linear Models on unseen data. Using the previously defined train–test split, we assess out-of-sample predictive performance and compute prediction intervals for the estimated flood occurrence probabilities.

For each selected model, we obtain predicted probabilities on the test set, construct approximate 95% confidence intervals on the response scale, and evaluate classification performance using a probability threshold. This allows us to assess both predictive uncertainty and practical classification accuracy.

# ========================================
# Model 1: Environmental + Hydrological
# ========================================

# First, we re-define our model
glm_M1_env <- glm(
  flood_occurrence_current_event ~ 
    elevation_m +
    distance_to_river_m +
    rainfall_7d_mm +
    monthly_rainfall_mm +
    drainage_index,
  data   = TrainData,
  family = binomial(link = "logit")
)

# 1) Predicted probabilities + 95% CI (test set)

# Predict on the linear predictor (log-odds scale)
pred_link_M1 <- predict(
  glm_M1_env,
  newdata = TestData,
  type    = "link",
  se.fit  = TRUE
)

# 95% confidence intervals on the link scale
crit <- 1.96
link_fit <- pred_link_M1$fit
link_lwr <- link_fit - crit * pred_link_M1$se.fit
link_upr <- link_fit + crit * pred_link_M1$se.fit

# Transform back to probability scale
pred_M1 <- data.frame(
  lower = glm_M1_env$family$linkinv(link_lwr),
  prediction = glm_M1_env$family$linkinv(link_fit),
  upper = glm_M1_env$family$linkinv(link_upr)
)

head(pred_M1)
##         lower  prediction       upper
## 1 0.047548529 0.053294793 0.059691974
## 2 0.001365132 0.001700045 0.002116949
## 3 0.003248647 0.003863529 0.004594255
## 4 0.275066105 0.296629008 0.319138119
## 5 0.068709125 0.076237559 0.084516021
## 6 0.008825952 0.010313385 0.012048448
# 2) Classification performance (0.5 threshold)

# Predicted probabilities
prob_M1 <- predict(glm_M1_env, TestData, type = "response")

# Class prediction using 0.5 threshold
class_M1 <- ifelse(prob_M1 > 0.5, "Yes", "No")

# Accuracy
acc_M1 <- mean(class_M1 == TestData$flood_occurrence_current_event)
acc_M1
## [1] 0.9118
# ====================================================
# Model 2: Comprehensive (Env + Socio + Categorical)
# ====================================================


# First, we re-define our model
glm_M4_full <- glm(
  flood_occurrence_current_event ~
    elevation_m +
    distance_to_river_m +
    rainfall_7d_mm +
    monthly_rainfall_mm +
    drainage_index +
    population_density_per_km2 +
    built_up_percent +
    infrastructure_score +
    landcover +
    soil_type +
    urban_rural,
  data   = TrainData,
  family = binomial(link = "logit")
)

# 1) Predicted probabilities + 95% CI (test set)

pred_link_M4 <- predict(
  glm_M4_full,
  newdata = TestData,
  type    = "link",
  se.fit  = TRUE
)

link_fit <- pred_link_M4$fit
link_lwr <- link_fit - crit * pred_link_M4$se.fit
link_upr <- link_fit + crit * pred_link_M4$se.fit

pred_M4 <- data.frame(
  lower = glm_M4_full$family$linkinv(link_lwr),
  prediction = glm_M4_full$family$linkinv(link_fit),
  upper = glm_M4_full$family$linkinv(link_upr)
)

head(pred_M4)
##         lower  prediction       upper
## 1 0.062025633 0.079603292 0.101622647
## 2 0.001387318 0.001909206 0.002626903
## 3 0.002828360 0.003730895 0.004920010
## 4 0.242199582 0.283974977 0.329819833
## 5 0.041691260 0.053237594 0.067755579
## 6 0.008390032 0.010887622 0.014118124
# 2) Classification performance (0.5 threshold)

prob_M4 <- predict(glm_M4_full, TestData, type = "response")
class_M4 <- ifelse(prob_M4 > 0.5, "Yes", "No")

acc_M4 <- mean(class_M4 == TestData$flood_occurrence_current_event)
acc_M4
## [1] 0.9114

We can conclude that:

  • Both selected models produce sensible out-of-sample predicted probabilities on the test set, together with 95% confidence intervals on the response scale.
  • Using a 0.5 classification threshold, the test accuracy is very similar for both models:
    Model 1 (Env + Hydro): 0.9118
    Model 2 (Comprehensive): 0.9114
  • The difference in accuracy between the two models is negligible, and Model 1 is marginally higher despite being simpler.
  • Given the strong class imbalance (≈ 10% “Yes”), accuracy values around 0.91 should be interpreted carefully: a naïve “always predict No” rule would already achieve close to the non-event rate (≈ 0.89). Therefore, the main value of these models is better captured by probabilistic outputs (and, ideally, metrics beyond accuracy).
  • Overall, Model 1 provides a competitive out-of-sample performance while remaining more interpretable, whereas Model 2 does not yield a meaningful accuracy gain despite higher complexity.